Imagine this scenario:
An Oracle 9i database with tens of thousands of transactional data, which accumulates during the day. This data was to be ETLed to a Microsoft SQL Server 2000 data mart (in reality, a database), and a dimensional data mart (SQL Server 2000 Analysis Services database) will be then created on top of this relational data mart. All this should happen during the night and should be ready the following morning. That’s because; when the day dawns, the decision makers would arrive. And when they do not find the properly updated cubes in the data mart, they would rather slice and dice my… um… donkey!
Now to summarize what we have in the SQL Server 2000 database… We have dimension tables, transaction tables (which are being used for staging purposes) and fact tables (from which the cubes will be created). The dimension tables will be populated first, then the transaction tables and then the fact tables (which would get the data from the transactional data)
Another hitch apart from all the data coming into the cubes before dawn, was that some transactions (i.e. the ones upto the previous 2 months) could change! Yes, you could well imagine my annoyance!
But as I had to follow orders, I came up with a plan to delete transactional data from the current date to upto 2 months before on the transaction and fact tables and then perform the ETL from that point onward. But this would increase the amount of records which I had to ETL every night. But since the process was going ok and the decision makers were able to slice and dice their cubes everyday, I was fine.
The problem started 2 months later, when the number of people who perform transactions started to increase. In the end the entire ETL was crawling on its belly for more than 12 hours!!! There were actually two things affecting this:
- The large number of transactions
- The under-configured server (I couldn’t take the blame alone, could I?)
Coming back to the large number of transactions, I discovered something staggering! The ETL from Oracle to the staging transaction tables was happening quite fast (a little more than 2 hours), but the populating of data onto the fact tables was what was eating it all up! Imagine That! Transferring data from a couple of transactional tables to a fact table (both residing on the same database) taking close to half a day!
Something worth noting is how the transaction table to fact table ETL was designed: Stored Procedures using cursors to lookup around half a dozen dimensions, twice or thrice over.
The solution that I implemented was: dropped all the stored procedures, dropped all the fact tables, created views on top of the transaction tables, and made them look exactly like the fact tables. The next step was to fool the cubes into believing the views were actually the fact tables, which was quite simple. Then we changed the server, to match the required configuration. And voila! Man and machine solved the issue. The process now takes 1 1/2 hours tops to finish.
Which of the solutions actually did the trick? I didn’t want to find out. I’d like to believe it was both!