Data Warehousing

Comparing Results using T-SQL

Posted on June 25, 2012November 18, 2014 by Gogula G. Aryalingam

Usually, in data migration or ETL projects there is a need for comparing source data with that of the destination once the data is loaded. The challenge is in comparing each field from the source with its counterpart on the destination, for each record – and you can have a whole lot of fields.

The query that you would write to do the comparison, would be a large tangle of comparison operators – even for tables of a few columns, increasing in complexity as the number of columns grow.

A simpler solution would be to first convert the source and the destination into tables of just three fields: The Key field, the Field field and the Value field. (In case you have a multi-column key, you would obviously have more than three fields – but the principle is the same)

Consider the following source:

Table 1: Source
Seq_code	Product_Name	Color	Unit_Price	Category_Name	Weight
C01	Crystal parrot	None	130.00	ORNAMENT	.75
C02	Silver pirate	Silver	35.00	ORNAMENT	1
C03	Pearl necklace	White	220.00	JEWELRY	.35
C05	Ceramic dog	Blue	15.00	ORNAMENT	.35

and then consider the following destination (assuming that data is already loaded through some sort of ETL process):

Table 2: Destination
Code	Name	Color	Price	Category	Weight
C01	Crystal parrot	None	130.00	Ornament	75
C02	Silver pirate	Silver	35.00	Ornament	1
C03	Pearl necklace	White	220.00	Jewelry	35
C05	Ceramic dog	Blue	15.00	Ornament	35

The Need

You need to compare these two, in order to pull out any discrepancies.

The Plan

You first mold each of these (i.e. the source and destination) to look like this:

Table 3: Pivoted Destination
Key	Field	Value
C01	Name	Crystal parrot
C01	Color	None
C01	Price	130.00
C01	Category	Ornament
C01	Weight	75
C02	Name	Silver pirate
…	…	…
C05	Weight	35

Here, you have the Key field (Code) of each record repeated for each field of that record along with the name of the field and the value of the field.

Before you do the molding you need to ensure that the field names in the original tables are the same. I.e. Seq_code in the source and Code in the destination should be changed to have the same name: Code, maybe.

Hence, the molded source should look like this:

Table 4: Pivoted Source
Key	Field	Value
C01	Name	Crystal parrot
C01	Color	None
C01	Price	130.00
C01	Category	ORNAMENT
C01	Weight	.75
C02	Name	Silver pirate
…	…	…
C05	Weight	.35

You now do the comparison using INNER JOIN on the Key + Field columns, and a WHERE to filter out the values that do not match. The molding can be done using UNPIVOT, so that there is no need to create any intermediate tables.

The Code

WITH
-- Pivot the source data
SourcePivot
AS
(
	SELECT	[Code],
			[Field],
			[Value]
	FROM	(
				SELECT	[Code] =		[Seq_Code],
						[Name] =		[Product_Name],
						[Color],
						[Price] =		CAST([Unit_Price] AS VARCHAR(20)),	-- Data type conversion since all
						[Category] =	[Category_Name],					-- fields have to be of the same
						[Weight] =		CAST([Weight] AS VARCHAR(20))		-- data type.
				FROM	Source
			) AS Source
			UNPIVOT
			(
				[Value] FOR [Field] IN ([Name], [Color], [Price], [Category], [Weight])
			) AS UnPvt
),

-- Pivot the destination data
DestinationPivot
AS
(
	SELECT	[Code],
			[Field],
			[Value]
	FROM	(
				SELECT	[Code],
						[Name],
						[Color],
						[Price] =	CAST([Price] AS VARCHAR(20)),
						[Category],
						[Weight] =	CAST(CAST([Weight] AS DECIMAL(8, 2)) AS VARCHAR(20))
				FROM	Destination
			) AS Destination
			UNPIVOT
			(
				[Value] FOR [Field] IN ([Name], [Color], [Price], [Category], [Weight])
			) AS UnPvt
)

-- Query for discrepancies
SELECT	SP.[Code],
		SP.[Field],
		SP.[Value] AS [SourceValue],
		DP.[Value] AS [DestinationValue]
FROM	SourcePivot SP
		INNER JOIN DestinationPivot DP
			ON SP.[Code] = DP.[Code]
			AND SP.[Field] = DP.[Field]
WHERE	SP.[Value] &amp;lt;&amp;gt; DP.[Value] COLLATE SQL_Latin1_General_CP1_CS_AS -- Use a case-sensitive collation on case insensitive
																	  -- tables/datbase if case difference is a concern.

The Result

Table 5: Comparison Result
Code	Field	SourceValue	DestinationValue
C01	Category	ORNAMENT	Ornament
C01	Weight	0.75	75
C02	Category	ORNAMENT	Ornament
C03	Category	JEWELRY	Jewelry
C03	Weight	0.35	35
C05	Category	ORNAMENT	Ornament
C05	Weight	0.35	35

<<Download sample code>>

To Key or Not To Key – That is the Question

Posted on June 6, 2011June 8, 2011 by Gogula G. Aryalingam

Forgive the cliché up there, but I could not resist it!

In data warehousing (and data marting), we usually use the popular dimensional model (or the star schema) to design the data warehouse (or data mart). And it is a common conclusion that your fact tables are going to be huge. And would be updated on a scheduled basis. Could be monthly, could be daily and from recent times, even hourly or sometimes (God forbid) on the minute!

Now, there have been arguments in the past as to whether we actually need to create primary key constraints, and especially if foreign key constraints need to be created between the fact table and the related dimension tables. Do we need to create indexes on the different columns.

Some may be shocked by such a question. I mean is this something that need to be doubted? Of course you need keys. Where else some others may be quite liberal and say, No! Of course you don’t need these keys and indexes. There are valid points to counter-argue each approach. There can also be a middle path.

In order to fuel this fire I have a question going on in the Beyond Relational BI Quiz 2011. This is a great place where you can prove your point of what you think. Points will be rewarded based on how well your arguments are presented. And then you get a chance to win an Apple iPad if your aggregate of the whole quiz is at the top.

The question goes as follows:

You are designing a data mart that results in a star schema on an SQL Server 2008 R2 server. A fact table with 7 dimensions, initially containing records in the region of 2,000,000. You expect the fact table to grow increasing by around 50,000 each month. Data loads happen on a weekly basis, and sometimes on an ad-hoc basis twice or thrice a month. Would you or would you not design primary key constraints, foreign key constrains and indexes for the data mart? If so how and why would you proceed, or why would you not? Discuss.

Like I said, points would be awarded based on how well you present your arguments. Click Here to Submit an Answer. Good luck!

Now Everyone Can BI

Posted on April 8, 2011October 4, 2011 by Gogula G. Aryalingam

If you had flown AirAsia, you would probably know where I stole the title to this post from. Microsoft did the same thing too. Heh heh! No, not the stealing bit, but providing for (almost) every small organization to use BI – Something which only the big bucks could afford.

Continue reading Now Everyone Can BI