Speeding up ETL data cleansing

While I was at the PASS Summit this year, I was sitting in Paul Turleys’s (b|t) session on creating integrated BI solutions. Paul was talking about how ETL takes up most of a BI solution’s development lifetime, regardless of how you planned for it. That’s when someone from the audience asked the question: “…methodology or tool in the MS stack to shorten the time taken for ETL?”. Paul’s answer was: “There’s no silver bullet, But I’m going to say it’s Power Query”. This got me thinking, since for a few weeks, my team had been struggling to get data cleansed on the project that they were working on.

Despite having a data profiling document, which documented how source data mapped to the data model, along with the rules that needed to be applied to these data, such as nullability, constraints, valid domain values and so forth, bad data continued to come in. Apart from bad data, there were scenarios that were not identified nor thought of before that were coming in with the data. Though we do usually expect some amount of bad data or anomalies, which can be reported on, this was pretty bad. We needed to have a data set that would not fail the ETL so that it can be at least tested for all business logic that we had identified as requirements.

The issue was that the team had to build the Integration Services packages first, before running it to finally find out the various data problems that kept spitting out. So, it took a while for the developer to identify the problem, realize the data issue, report it to the client, have it fixed by the client, (sometimes do a fix on the package) and repeat. There were times when I had to jump in to help figure out what the data problems were. But because getting onto a tool (specially non-Microsoft), connecting to the source, and then writing SQL queries to retrieve and check the data was quite a slow process, I preferred using Power Query.

Power BI is a self-service tool; it can connect to many different types of data source, the best thing about it is the ease of pulling in data, building joins, filtering, sorting and perform all the data manipulations that you want do is made super easy. And you could just save it all in one just file, and use it the next time you want it. So I get it (or at least formed an impression in my head), and related to it when Paul said “Power Query” as the tool to speed up the life time of ETL development.

Explaining how you could go about it is another post for the future, but I am presenting it at the Sri Lankan Data Community Meetup on Nov 29th in Colombo,if you are interested 😉

Unpivot Transformation with Multiple Destination Columns

The need to unpivot data during an ETL operation is quite common. One method to perform the unpivoting is writing the source query in such a way that the data starts off unpivoted using the UNPIVOT T-SQL statement in SQL Server 2005 and later. This however would obviously not work out on a source that does not support unpivoting such as a text file or Office Excel or even older versions of popular RDBMSs. The other method of course is to use the Unpivot Transformation in Integration Services.

The Unpivot Transformation is quite a simple component, especially if you need to unpivot a bunch of columns to a single destination column. Unpivoting to multple destination columns however, seems a tad challenging the first time round, at least for me it did. Until I figured out how to. This post is all about unpivoting a columns to multiple destination columns.

Consider this scenario. A simple Product table (which I shall call Product_Source this point forward):

Source Table (Product_Source)

Needs to be transformed into the following Product table (which I shall call Product_Destination this point forward):

Destination Table (Product_Destination)

Configuring the Unpivot transformation for this would be quite simple, and would look like this:

Unpivot Transformation for Single Destinations

The problem starts when you need the Product_Destination table looking like this, with multiple destination columns (Quantity and Price):

Destination Table (Product_Destination) with Multiple Destination Columns

You add the price columns as well (as in the image below) to the Unpivot transformation,

Unpivot Transformation for Multiple Destination Columns

and get the following validation error (encountered by a lot of people):

PivotKeyValue is not valid. In an UnPivot transform with more than one unpivoted DestinationColumn, the set of PivotKeyValues per destination must match exactly.

The reason is this. If you look closely at the Pivot Key Value column of the Unpivot transformation you would notice that for the Quantity destination column, the set of values are ‘Quantity1’, ‘Quantity2’ and ‘Quantity3’. Whereas, for the Price destination column, the set of values are ‘Price1’, ‘Price2’ and ‘Price3’, which are clearly not matching. The solution to the problem here is quite trivial. Change the default values of the Pivot Key Value column of the offending destination column to match the values of the other destination columns. Like this:

Mapping for Unpivot Transformation for Multiple Destination Columns

And you’re good to go!

I’ve attached a sample solution of this scenario, so that you too could try it out first hand.

Come Undone

Recently I was working on some ETL packages to transfer some flat file information to a SQL Server 2005 staging table. The particular package that I was working required removing some header and trailing records and then transferring the rest of the data. Some of the trailing records can only be identified by a particular junk character that appeared towards the end of the file. This resulted in me checking for this character in a Script Task in the SSIS package and then removing the line, among other things. The code for this header and trailer removing phase was around 40 to 50 lines long. The problem was the script never executed. And when I went back to check the script the entire code in the Script Task was missing! (As if some techno David Copperfield had waved his wand over my code). I rewrote the code twice over, tried saving the entire project and all kinds of silly things, yet it wouldn’t keep my code… Commenting line by line wouldn’t work either.

Only removing line by line showed me the perpetrator:

1 If str.Contains("?") Then 'ASCII Code 26 (Some fonts show it using a 'question mark' while others show a 'box')

So, myself goes and copies this character to the command prompt to see what this character actual is (This is because different types junk characters are all denoted using “?“). Guess what showed up on the command prompt: ^Z (a.k.a Undo).

Which meant, every time I saved and closed the script, the whole set of code comes undone! A quick word of admonishment to myself for not using proper standards, I quickly rewrite the script once again, but this time using the ASCII value of the character instead.

UPDATE: The above package was created using SP1 of SQL Server 2005. I just tried out the same thing with SP2, the issue seems partly fixed: The code doesn’t disappear when I close the Script Editor Window, but is cleared once I close the entire projects and open it up, unless of course the offending character is removed in its entirety…

The 5 Year Difference

Data Warehousing/Data Marting nearly always required the need for ETL. And most of the time it was the ETLing of data to create dimensions. And the one type of dimension I have worked with most is the type now known as Slowly Changing Dimension (this term was not used much before the advent of SQL Server 2005, but still).

The ETL tool that I have been using, Data Transformation Services of SQL Server 2000 was a decent enough tool for the purpose. I was quite content with it until; along came SQL Server 2005 carrying with it DTS’s new form: SSIS (SQL Server Integration Services). SSIS gives you things which could only be dreamt when working with DTS.

Let me now explain this magic through, how I would go about ETLing data from the source to a destination using each of the two technologies. First, DTS and then SSIS.

The source is a SQL Server 2000 table of Products already arranged in a parent-child hierarchy, and looking like this:



Th
e destination for the DTS process will be a SQL Server 2000 table laid out in a structure which would later make up a star schema, like this:


while the destination for the SSIS process will be a SQL Server 2005 table laid out just like the one above, like this:

while the data after the ETL in both cases should look like this:

DTS Steps (SQL Server 2000)

  • Create a new DTS package under Data Transformation Services in SQL Server Enterprise Manager.
  • Create two database connections.
    • One for the source
    • The other for the destination
  • Add a Data Driven Query Task.
  • Write a SQL statement to build a nice result set to transfer, under Source tab.
  • Set the destination
  • Create a Lookup so that we could check whether the record which is being transfered is a new record or an existing record.
  • Write an ActiveX script with logic to check if the record exists (using the Lookup created earlier) in the destination and appropriately call the Insert Query or the Update Query. (Takes quite a long time)
  • Write the Insert Query and the Update Query and set their properties.

SSIS Steps (SQL Server 2005)

  • Create a new Integration Services project in Business Intelligence Development Studio.
  • Create two connection managers for the source and destination.
  • Add a Data Flow Task.
  • Add a Data Flow Source to the task and set the source to it.
  • Add a Slowly Changing Dimension object. (The wizard opens up to unveil the magic!)
  • Write a SQL statement to build a nice result set to transfer.
  • Continue with the wizard and voila!!! No ActiveX scripts, No lookups to worry about, No Insert Queries nor Update Queries!!! – Your ETL is done! Basically the only thing you wrote was the source query , which again can be built using the Query Builder.

So now, I can build my dimensions quickly and that also without writing any code; just a series of clicks and drags!

To VIEW or not to VIEW

Imagine this scenario:
An Oracle 9i database with tens of thousands of transactional data, which accumulates during the day. This data was to be ETLed to a Microsoft SQL Server 2000 data mart (in reality, a database), and a dimensional data mart (SQL Server 2000 Analysis Services database) will be then created on top of this relational data mart. All this should happen during the night and should be ready the following morning. That’s because; when the day dawns, the decision makers would arrive. And when they do not find the properly updated cubes in the data mart, they would rather slice and dice my… um… donkey!

Now to summarize what we have in the SQL Server 2000 database… We have dimension tables, transaction tables (which are being used for staging purposes) and fact tables (from which the cubes will be created). The dimension tables will be populated first, then the transaction tables and then the fact tables (which would get the data from the transactional data)

Another hitch apart from all the data coming into the cubes before dawn, was that some transactions (i.e. the ones upto the previous 2 months) could change! Yes, you could well imagine my annoyance!

But as I had to follow orders, I came up with a plan to delete transactional data from the current date to upto 2 months before on the transaction and fact tables and then perform the ETL from that point onward. But this would increase the amount of records which I had to ETL every night. But since the process was going ok and the decision makers were able to slice and dice their cubes everyday, I was fine.

The Problem:
The problem started 2 months later, when the number of people who perform transactions started to increase. In the end the entire ETL was crawling on its belly for more than 12 hours!!! There were actually two things affecting this:

  1. The large number of transactions
  2. The under-configured server (I couldn’t take the blame alone, could I?)

Coming back to the large number of transactions, I discovered something staggering! The ETL from Oracle to the staging transaction tables was happening quite fast (a little more than 2 hours), but the populating of data onto the fact tables was what was eating it all up! Imagine That! Transferring data from a couple of transactional tables to a fact table (both residing on the same database) taking close to half a day!

Something worth noting is how the transaction table to fact table ETL was designed: Stored Procedures using cursors to lookup around half a dozen dimensions, twice or thrice over.

The Solution
The solution that I implemented was: dropped all the stored procedures, dropped all the fact tables, created views on top of the transaction tables, and made them look exactly like the fact tables. The next step was to fool the cubes into believing the views were actually the fact tables, which was quite simple. Then we changed the server, to match the required configuration. And voila! Man and machine solved the issue. The process now takes 1 1/2 hours tops to finish.

Conclusion:
Which of the solutions actually did the trick? I didn’t want to find out. I’d like to believe it was both!