Ugly is Beautiful: BCP in SSIS

We’ve always (or most often than not) favor using the out-of-the-box functionality available with SSIS when dumping data from one location to another. Say for instance, a project that I recently worked on, had a requirement to dump out data from several tables to corresponding comma separated (CSV) files. The approach I took was to use a data flow task for each table, and simply added a data flow source for the SQL table (wrote a stored procedure to select out the data) and a  data flow destination for the CSV, dragged the little green arrow, and the fields were automatically set (because the table and the CSV file had the same column names). Then repeated this for each table. Pretty and Simple no? Yes.

Dumping out the data of 5 very large tables however, took almost 35 minutes. Too much time. The whole thing had to run within half an hour according to the requirement.

Help came in the form of an ugly betty: Bulk Copy Program (BCP). The old, plain-looking yet trusted command line tool of SQL Server. The result: A whopping 29 minute reduction. The whole thing ran in just 6 minutes!

Ugly is indeed beautiful.

Advertisements

Random SSIS Advice: Limit using data flow tasks that sort, union, join

When using a Data Flow Task in Integration Services, try to avoid data manipulations such as unions, sorts and joins using data flow tasks. Try to include all join, sort and union operations within your source queries itself. This way, you can do away with run-time issues that crop up when running your package with large amounts of data. Usually various memory restriction errors.

Continue reading Random SSIS Advice: Limit using data flow tasks that sort, union, join

Many Flat Files. One Connection Manager.

One Connection Manager. Many Flat Files

I came about this solution, completely by chance when developing my latest Integration Services (SSIS) project, and it has indeed made my life very easy. I had about 50 flat files, each with a different number of fields, and each of those files had to be populated from different tables. This meant that I had to create a Flat File Connection Manager per flat file. Then, imagine 50 flat file connection managers. Imagine configuring each of them when deploying to QA, Live and all those levels.

And, then I considered the MULTIFLATFILE Connection Manager. Here you can configure the connection manager to connect to multiple flat files. The issue here is that all the files needed to have the same metadata: i.e. the same number of columns and data types. Also, when you set it up, at design time you can only see the first file that you had selected. The rest of the files will expose themselves only during run-time. That was not going solve my problem, since each of my flat files were of different meta data. They were like tables in a database. And each “table” was different from the other. Table… Database…!!! That was when I remembered a connection string format for text files from some time ago.

Connection Manager for Flat Files 01 Connection Manager for Flat Files 02

So I built myself an OLE DB connection manager using the good ol’ Jet 4.0 OLE DB provider, the resulting connection string being:

Provider=Microsoft.Jet.OLEDB.4.0; Data Source=D:\FlatFiles\; Extended Properties="Text;HDR=Yes;FMT=Delimited"

Note that the data source points to a folder rather than a file – the folder that contains the CSV files. The end resulting being a Connection Manager that connects to a folder that is analogous to a database containing tables. The tables in this case, are the CSV files. This is a view of how the CSV files from the folder below is displayed in an OLE DB Source task:

A view of the CSV files in a folder A view of how each CSV is displayed as a table in a database

Unpivot Transformation with Multiple Destination Columns

The need to unpivot data during an ETL operation is quite common. One method to perform the unpivoting is writing the source query in such a way that the data starts off unpivoted using the UNPIVOT T-SQL statement in SQL Server 2005 and later. This however would obviously not work out on a source that does not support unpivoting such as a text file or Office Excel or even older versions of popular RDBMSs. The other method of course is to use the Unpivot Transformation in Integration Services.

The Unpivot Transformation is quite a simple component, especially if you need to unpivot a bunch of columns to a single destination column. Unpivoting to multple destination columns however, seems a tad challenging the first time round, at least for me it did. Until I figured out how to. This post is all about unpivoting a columns to multiple destination columns.

Consider this scenario. A simple Product table (which I shall call Product_Source this point forward):

Source Table (Product_Source)

Needs to be transformed into the following Product table (which I shall call Product_Destination this point forward):

Destination Table (Product_Destination)

Configuring the Unpivot transformation for this would be quite simple, and would look like this:

Unpivot Transformation for Single Destinations

The problem starts when you need the Product_Destination table looking like this, with multiple destination columns (Quantity and Price):

Destination Table (Product_Destination) with Multiple Destination Columns

You add the price columns as well (as in the image below) to the Unpivot transformation,

Unpivot Transformation for Multiple Destination Columns

and get the following validation error (encountered by a lot of people):

PivotKeyValue is not valid. In an UnPivot transform with more than one unpivoted DestinationColumn, the set of PivotKeyValues per destination must match exactly.

The reason is this. If you look closely at the Pivot Key Value column of the Unpivot transformation you would notice that for the Quantity destination column, the set of values are ‘Quantity1’, ‘Quantity2’ and ‘Quantity3’. Whereas, for the Price destination column, the set of values are ‘Price1’, ‘Price2’ and ‘Price3’, which are clearly not matching. The solution to the problem here is quite trivial. Change the default values of the Pivot Key Value column of the offending destination column to match the values of the other destination columns. Like this:

Mapping for Unpivot Transformation for Multiple Destination Columns

And you’re good to go!

I’ve attached a sample solution of this scenario, so that you too could try it out first hand.

Populating an Intermediate Fact Table

This is but one scenario where and how we use an intermediate fact table. Intermediate fact tables are used in cases where you have a many-to-many relationship between two dimensions (also known as Many-to-many dimensions). To introduce a simple example:

Imagine a bank. Account holders. Bank accounts. A bank account can be held by multiple account holders (joint account), whereas an account holder may hold multiple accounts. Transactions are usually recorded based on accounts. However, when business users need to analyze the business data they would like to analyze from the Account holder point of view as well, hence the data model may have to be put up to look something like this:

IntermediateFactTable

Check out these links to know more about many-to-many dimensions; The Many to Many Relationship, Many-to-Many dimensions in Analysis Services or try binging it.

My scenario is this. The source system has Groups and Users. Both of them have been identified as dimensions (DimGroup and DimUser). ETLs already created and tested. Since a user can exist in multiple groups and also obviously a group can have multiple users in them, makes this an ideal many-to-many scenario. Something simple and noteworthy in this scenario is that historical tracking is not necessary. Which means I need not store the history of a user belonging to a particular group 3 months ago and now does not.

This is how the data population was done (image contains detailed steps):

PopulatingIntermediateTable

Reasoning: The data to be stored in the target is made up of only two columns that are composite (GroupID and UserID), hence using the slowly changing dimension would not do. The records that come in from the source after getting looked up for the appropriate keys in their respective dimension tables (i.e. DimGroup and DimUser) either shall be inserted if not matched with the destination (i.e. DimGroupDimUser) or be sent on their way (matching records). These matching records are right-outer-joined against records from the destination (i.e. DimGroupDimUser) in order to be deleted (where matching records’ UserID == NULL and matching records’ GroupID == NULL). There are no updates that can happen here. If an update happens at the source; a record will be deleted and another inserted at the destination (This is more or less why the Slowly Changing Dimension cannot be used in this scenario).

Of course, I feel that this could be done in other (maybe even better ways). I would love to hear your comments on this.