To Key or Not To Key – That is the Question

Forgive the cliché up there, but I could not resist it!

In data warehousing (and data marting), we usually use the popular dimensional model (or the star schema) to design the data warehouse (or data mart). And it is a common conclusion that your fact tables are going to be huge. And would be updated on a scheduled basis. Could be monthly, could be daily and from recent times, even hourly or sometimes (God forbid) on the minute!

Now, there have been arguments in the past as to whether we actually need to create primary key constraints, and especially if foreign key constraints need to be created between the fact table and the related dimension tables. Do we need to create indexes on the different columns.

Some may be shocked by such a question. I mean is this something that need to be doubted? Of course you need keys. Where else some others may be quite liberal and say, No! Of course you don’t need these keys and indexes. There are valid points to counter-argue each approach. There can also be a middle path.

In order to fuel this fire I have a question going on in the Beyond Relational BI Quiz 2011. This is a great place where you can prove your point of what you think. Points will be rewarded based on how well your arguments are presented. And then you get a chance to win an Apple iPad if your aggregate of the whole quiz is at the top.

The question goes as follows:

You are designing a data mart that results in a star schema on an SQL Server 2008 R2 server. A fact table with 7 dimensions, initially containing records in the region of 2,000,000. You expect the fact table to grow increasing by around 50,000 each month. Data loads happen on a weekly basis, and sometimes on an ad-hoc basis twice or thrice a month. Would you or would you not design primary key constraints, foreign key constrains and indexes for the data mart? If so how and why would you proceed, or why would you not? Discuss.

Like I said, points would be awarded based on how well you present your arguments.  Click Here to Submit an Answer. Good luck!

My Data Mart is Better than Your Data Warehouse

Everyone talks about data warehouses when it comes to business intelligence (BI); but seldom about data marts. They mostly ignore the term ‘data mart’ even though it is what they are actually working with. Even yours truly is guilty of that more often than not. Alright, so what’s so ding-dong special about a data mart? And what’s this fuss that I’m trying to pull?

Continue reading My Data Mart is Better than Your Data Warehouse

The 5 Year Difference

Data Warehousing/Data Marting nearly always required the need for ETL. And most of the time it was the ETLing of data to create dimensions. And the one type of dimension I have worked with most is the type now known as Slowly Changing Dimension (this term was not used much before the advent of SQL Server 2005, but still).

The ETL tool that I have been using, Data Transformation Services of SQL Server 2000 was a decent enough tool for the purpose. I was quite content with it until; along came SQL Server 2005 carrying with it DTS’s new form: SSIS (SQL Server Integration Services). SSIS gives you things which could only be dreamt when working with DTS.

Let me now explain this magic through, how I would go about ETLing data from the source to a destination using each of the two technologies. First, DTS and then SSIS.

The source is a SQL Server 2000 table of Products already arranged in a parent-child hierarchy, and looking like this:



Th
e destination for the DTS process will be a SQL Server 2000 table laid out in a structure which would later make up a star schema, like this:


while the destination for the SSIS process will be a SQL Server 2005 table laid out just like the one above, like this:

while the data after the ETL in both cases should look like this:

DTS Steps (SQL Server 2000)

  • Create a new DTS package under Data Transformation Services in SQL Server Enterprise Manager.
  • Create two database connections.
    • One for the source
    • The other for the destination
  • Add a Data Driven Query Task.
  • Write a SQL statement to build a nice result set to transfer, under Source tab.
  • Set the destination
  • Create a Lookup so that we could check whether the record which is being transfered is a new record or an existing record.
  • Write an ActiveX script with logic to check if the record exists (using the Lookup created earlier) in the destination and appropriately call the Insert Query or the Update Query. (Takes quite a long time)
  • Write the Insert Query and the Update Query and set their properties.

SSIS Steps (SQL Server 2005)

  • Create a new Integration Services project in Business Intelligence Development Studio.
  • Create two connection managers for the source and destination.
  • Add a Data Flow Task.
  • Add a Data Flow Source to the task and set the source to it.
  • Add a Slowly Changing Dimension object. (The wizard opens up to unveil the magic!)
  • Write a SQL statement to build a nice result set to transfer.
  • Continue with the wizard and voila!!! No ActiveX scripts, No lookups to worry about, No Insert Queries nor Update Queries!!! – Your ETL is done! Basically the only thing you wrote was the source query , which again can be built using the Query Builder.

So now, I can build my dimensions quickly and that also without writing any code; just a series of clicks and drags!

To VIEW or not to VIEW

Imagine this scenario:
An Oracle 9i database with tens of thousands of transactional data, which accumulates during the day. This data was to be ETLed to a Microsoft SQL Server 2000 data mart (in reality, a database), and a dimensional data mart (SQL Server 2000 Analysis Services database) will be then created on top of this relational data mart. All this should happen during the night and should be ready the following morning. That’s because; when the day dawns, the decision makers would arrive. And when they do not find the properly updated cubes in the data mart, they would rather slice and dice my… um… donkey!

Now to summarize what we have in the SQL Server 2000 database… We have dimension tables, transaction tables (which are being used for staging purposes) and fact tables (from which the cubes will be created). The dimension tables will be populated first, then the transaction tables and then the fact tables (which would get the data from the transactional data)

Another hitch apart from all the data coming into the cubes before dawn, was that some transactions (i.e. the ones upto the previous 2 months) could change! Yes, you could well imagine my annoyance!

But as I had to follow orders, I came up with a plan to delete transactional data from the current date to upto 2 months before on the transaction and fact tables and then perform the ETL from that point onward. But this would increase the amount of records which I had to ETL every night. But since the process was going ok and the decision makers were able to slice and dice their cubes everyday, I was fine.

The Problem:
The problem started 2 months later, when the number of people who perform transactions started to increase. In the end the entire ETL was crawling on its belly for more than 12 hours!!! There were actually two things affecting this:

  1. The large number of transactions
  2. The under-configured server (I couldn’t take the blame alone, could I?)

Coming back to the large number of transactions, I discovered something staggering! The ETL from Oracle to the staging transaction tables was happening quite fast (a little more than 2 hours), but the populating of data onto the fact tables was what was eating it all up! Imagine That! Transferring data from a couple of transactional tables to a fact table (both residing on the same database) taking close to half a day!

Something worth noting is how the transaction table to fact table ETL was designed: Stored Procedures using cursors to lookup around half a dozen dimensions, twice or thrice over.

The Solution
The solution that I implemented was: dropped all the stored procedures, dropped all the fact tables, created views on top of the transaction tables, and made them look exactly like the fact tables. The next step was to fool the cubes into believing the views were actually the fact tables, which was quite simple. Then we changed the server, to match the required configuration. And voila! Man and machine solved the issue. The process now takes 1 1/2 hours tops to finish.

Conclusion:
Which of the solutions actually did the trick? I didn’t want to find out. I’d like to believe it was both!