Speeding up ETL data cleansing

While I was at the PASS Summit this year, I was sitting in Paul Turleys’s (b|t) session on creating integrated BI solutions. Paul was talking about how ETL takes up most of a BI solution’s development lifetime, regardless of how you planned for it. That’s when someone from the audience asked the question: “…methodology or tool in the MS stack to shorten the time taken for ETL?”. Paul’s answer was: “There’s no silver bullet, But I’m going to say it’s Power Query”. This got me thinking, since for a few weeks, my team had been struggling to get data cleansed on the project that they were working on.

Despite having a data profiling document, which documented how source data mapped to the data model, along with the rules that needed to be applied to these data, such as nullability, constraints, valid domain values and so forth, bad data continued to come in. Apart from bad data, there were scenarios that were not identified nor thought of before that were coming in with the data. Though we do usually expect some amount of bad data or anomalies, which can be reported on, this was pretty bad. We needed to have a data set that would not fail the ETL so that it can be at least tested for all business logic that we had identified as requirements.

The issue was that the team had to build the Integration Services packages first, before running it to finally find out the various data problems that kept spitting out. So, it took a while for the developer to identify the problem, realize the data issue, report it to the client, have it fixed by the client, (sometimes do a fix on the package) and repeat. There were times when I had to jump in to help figure out what the data problems were. But because getting onto a tool (specially non-Microsoft), connecting to the source, and then writing SQL queries to retrieve and check the data was quite a slow process, I preferred using Power Query.

Power BI is a self-service tool; it can connect to many different types of data source, the best thing about it is the ease of pulling in data, building joins, filtering, sorting and perform all the data manipulations that you want do is made super easy. And you could just save it all in one just file, and use it the next time you want it. So I get it (or at least formed an impression in my head), and related to it when Paul said “Power Query” as the tool to speed up the life time of ETL development.

Explaining how you could go about it is another post for the future, but I am presenting it at the Sri Lankan Data Community Meetup on Nov 29th in Colombo,if you are interested 😉

Ugly is Beautiful: BCP in SSIS

We’ve always (or most often than not) favor using the out-of-the-box functionality available with SSIS when dumping data from one location to another. Say for instance, a project that I recently worked on, had a requirement to dump out data from several tables to corresponding comma separated (CSV) files. The approach I took was to use a data flow task for each table, and simply added a data flow source for the SQL table (wrote a stored procedure to select out the data) and a  data flow destination for the CSV, dragged the little green arrow, and the fields were automatically set (because the table and the CSV file had the same column names). Then repeated this for each table. Pretty and Simple no? Yes.

Dumping out the data of 5 very large tables however, took almost 35 minutes. Too much time. The whole thing had to run within half an hour according to the requirement.

Help came in the form of an ugly betty: Bulk Copy Program (BCP). The old, plain-looking yet trusted command line tool of SQL Server. The result: A whopping 29 minute reduction. The whole thing ran in just 6 minutes!

Ugly is indeed beautiful.

Random SSIS Advice: Limit using data flow tasks that sort, union, join

When using a Data Flow Task in Integration Services, try to avoid data manipulations such as unions, sorts and joins using data flow tasks. Try to include all join, sort and union operations within your source queries itself. This way, you can do away with run-time issues that crop up when running your package with large amounts of data. Usually various memory restriction errors.

Continue reading Random SSIS Advice: Limit using data flow tasks that sort, union, join

Many Flat Files. One Connection Manager.

One Connection Manager. Many Flat Files

I came about this solution, completely by chance when developing my latest Integration Services (SSIS) project, and it has indeed made my life very easy. I had about 50 flat files, each with a different number of fields, and each of those files had to be populated from different tables. This meant that I had to create a Flat File Connection Manager per flat file. Then, imagine 50 flat file connection managers. Imagine configuring each of them when deploying to QA, Live and all those levels.

And, then I considered the MULTIFLATFILE Connection Manager. Here you can configure the connection manager to connect to multiple flat files. The issue here is that all the files needed to have the same metadata: i.e. the same number of columns and data types. Also, when you set it up, at design time you can only see the first file that you had selected. The rest of the files will expose themselves only during run-time. That was not going solve my problem, since each of my flat files were of different meta data. They were like tables in a database. And each “table” was different from the other. Table… Database…!!! That was when I remembered a connection string format for text files from some time ago.

Connection Manager for Flat Files 01 Connection Manager for Flat Files 02

So I built myself an OLE DB connection manager using the good ol’ Jet 4.0 OLE DB provider, the resulting connection string being:

Provider=Microsoft.Jet.OLEDB.4.0; Data Source=D:\FlatFiles\; Extended Properties="Text;HDR=Yes;FMT=Delimited"

Note that the data source points to a folder rather than a file – the folder that contains the CSV files. The end resulting being a Connection Manager that connects to a folder that is analogous to a database containing tables. The tables in this case, are the CSV files. This is a view of how the CSV files from the folder below is displayed in an OLE DB Source task:

A view of the CSV files in a folder A view of how each CSV is displayed as a table in a database

Unpivot Transformation with Multiple Destination Columns

The need to unpivot data during an ETL operation is quite common. One method to perform the unpivoting is writing the source query in such a way that the data starts off unpivoted using the UNPIVOT T-SQL statement in SQL Server 2005 and later. This however would obviously not work out on a source that does not support unpivoting such as a text file or Office Excel or even older versions of popular RDBMSs. The other method of course is to use the Unpivot Transformation in Integration Services.

The Unpivot Transformation is quite a simple component, especially if you need to unpivot a bunch of columns to a single destination column. Unpivoting to multple destination columns however, seems a tad challenging the first time round, at least for me it did. Until I figured out how to. This post is all about unpivoting a columns to multiple destination columns.

Consider this scenario. A simple Product table (which I shall call Product_Source this point forward):

Source Table (Product_Source)

Needs to be transformed into the following Product table (which I shall call Product_Destination this point forward):

Destination Table (Product_Destination)

Configuring the Unpivot transformation for this would be quite simple, and would look like this:

Unpivot Transformation for Single Destinations

The problem starts when you need the Product_Destination table looking like this, with multiple destination columns (Quantity and Price):

Destination Table (Product_Destination) with Multiple Destination Columns

You add the price columns as well (as in the image below) to the Unpivot transformation,

Unpivot Transformation for Multiple Destination Columns

and get the following validation error (encountered by a lot of people):

PivotKeyValue is not valid. In an UnPivot transform with more than one unpivoted DestinationColumn, the set of PivotKeyValues per destination must match exactly.

The reason is this. If you look closely at the Pivot Key Value column of the Unpivot transformation you would notice that for the Quantity destination column, the set of values are ‘Quantity1’, ‘Quantity2’ and ‘Quantity3’. Whereas, for the Price destination column, the set of values are ‘Price1’, ‘Price2’ and ‘Price3’, which are clearly not matching. The solution to the problem here is quite trivial. Change the default values of the Pivot Key Value column of the offending destination column to match the values of the other destination columns. Like this:

Mapping for Unpivot Transformation for Multiple Destination Columns

And you’re good to go!

I’ve attached a sample solution of this scenario, so that you too could try it out first hand.