Speeding up ETL data cleansing

While I was at the PASS Summit this year, I was sitting in Paul Turleys’s (b|t) session on creating integrated BI solutions. Paul was talking about how ETL takes up most of a BI solution’s development lifetime, regardless of how you planned for it. That’s when someone from the audience asked the question: “…methodology or tool in the MS stack to shorten the time taken for ETL?”. Paul’s answer was: “There’s no silver bullet, But I’m going to say it’s Power Query”. This got me thinking, since for a few weeks, my team had been struggling to get data cleansed on the project that they were working on.

Despite having a data profiling document, which documented how source data mapped to the data model, along with the rules that needed to be applied to these data, such as nullability, constraints, valid domain values and so forth, bad data continued to come in. Apart from bad data, there were scenarios that were not identified nor thought of before that were coming in with the data. Though we do usually expect some amount of bad data or anomalies, which can be reported on, this was pretty bad. We needed to have a data set that would not fail the ETL so that it can be at least tested for all business logic that we had identified as requirements.

The issue was that the team had to build the Integration Services packages first, before running it to finally find out the various data problems that kept spitting out. So, it took a while for the developer to identify the problem, realize the data issue, report it to the client, have it fixed by the client, (sometimes do a fix on the package) and repeat. There were times when I had to jump in to help figure out what the data problems were. But because getting onto a tool (specially non-Microsoft), connecting to the source, and then writing SQL queries to retrieve and check the data was quite a slow process, I preferred using Power Query.

Power BI is a self-service tool; it can connect to many different types of data source, the best thing about it is the ease of pulling in data, building joins, filtering, sorting and perform all the data manipulations that you want do is made super easy. And you could just save it all in one just file, and use it the next time you want it. So I get it (or at least formed an impression in my head), and related to it when Paul said “Power Query” as the tool to speed up the life time of ETL development.

Explaining how you could go about it is another post for the future, but I am presenting it at the Sri Lankan Data Community Meetup on Nov 29th in Colombo,if you are interested 😉

Advertisements

Power BI: Another Player in the Game

First up, Power BI for Office 365

Power BI for Office 365

An year ago till a couple of months ago, when we spoke of Power BI, it meant just one thing: Power BI for Office 365. Power BI for Office 365 is portrayed as a self-service business intelligence (or data analytics) tool on the cloud. The product is basically;

  • Excel 2013 with a bunch of add-ons for report authoring,
  • and special sites called Power BI sites on Office 365 for sharing and collaboration of the reports created using Excel, plus a few more goodies

Going further, Excel 2013 is the essential component of Power BI for Office 365. It is where all the data extraction, clean-up, consolidation and modeling happens (Power Query and Power Pivot). It is also where the reports and analytics is done (Power View and Power Map). The best thing about this is that all these add-on that I have mentioned are free, and can be used even without purchasing Power BI for Office 365 (PowerMap requires that you have an Office 365 ProPlus subscription). Then, if you need to share it with others, you to put the workbook on a Power BI site. This enables others in your team to see the data and collaborate on it as well. Apart from this, Power BI for Office 365 allows

  • analysis using natural language questions (in English) with Q&A,
  • creating and managing a repository of queries and data sources (including on-premise), on the cloud,
  • and access data from the Power BI Windows Store App so that information is available on the go on Windows tables

The pricing for Power BI for Office 365 currently looks like this, and has more than what I have described here. For more information I recommend reading this, the Power BI for Office 365 service description and FAQs.

And now, there is a new player in the game. And it’s from Microsoft itself…

Continue reading Power BI: Another Player in the Game

Build your Own Date Dimension for Power BI – Part 2

In my previous post I talked about and demoed how to get connected to a date data set on the Azure Marketplace. The purpose was to access date information to build a date dimension for a Power BI solution. In this post I will focus on how one would pull in data from the said data set through Power Query, massage it to suit my requirements and finally model it in Power Pivot for consumption.

To do this, let us make use of a simple scenario where I want a date dimension with Years, Months and Dates that go from Year 2000 to 2015. There are two major steps that I will perform:

  1. Extract, Transform and Load the date data to Power Pivot
  2. Structure the loaded data table into a Date dimension

Continue reading Build your Own Date Dimension for Power BI – Part 2

Build your Own Date Dimension for Power BI – Part 1

The most common dimension used in a BI solution is the Date dimension. This is obvious since almost any type of analysis involves looking at data from some aspect of date; year, month, or even hours and minutes in some cases. In traditional BI scenarios, a Date dimension table is always created as a practice, as is an ETL to populate it. The usual case is a stored procedure which generates a range of dates between two given dates, and scheduled to run on a regular basis: yearly, quarterly etc.

In the case of a Power BI solution, one would not have the luxury of a date-generator stored procedure. Instead we could go for a purpose-created Excel file with all necessary fields with a range of dates. The issue with this would be that if you plan to use it across various disparate solutions, the Excel would have to be quite comprehensive with a large range of dates to support various scenarios. Locality will also have to be taken into consideration, such as a British calendar or a US one for example.

Continue reading Build your Own Date Dimension for Power BI – Part 1

Self-Service Business Intelligence Presentations

For those who are interested, here are the two presentations that I presented at Dev Day 2014 and SLASSCOM TechTalks: Smart Data Engineering on the 17th and 26th of November, respectively.

Serve Yourself: Self-Service Business Intelligence

(Dev Day 2014)

Self-Service Business Intelligence

(SLASSCOM TechTalks: Smart Data Engineering. Lightning Talk)