So I got this wonderful opportunity to present at Dev Day 2014, here in Colombo, a couple weeks ago, and since my craze with Power BI was still at an all time high, I submitted to talk about self-service business intelligence. Despite the concept being around in the mainstream for at least a couple of years, it was decided that it was a fantastic topic to be presented on, and I was given the go ahead. A week later, I got another opportunity to present the same, albeit in a constricted version, as part of a series of lightning talks at the first ever SLASSCOM TechTalks session, titled “Smart Data Engineering”.
This post is a follow-up to these presentations, where I try to explain my demo that I performed on stage.
The two events mostly catered to developers and other techies. These are the type of folk who easily get bored with only slides to look at. So I needed to build a demo that was not only simple, but non-business-intrinsic and Sri Lanka-specific so that I could give more perspective, and the audience would understand it better.
So, I decided to do what any patriotic presenter would:
Googled Bing’d Googled for “Sri Lanka Government Data” because I was looking for some statistics from government departments, and also since the truth would not come from anywhere else. Right? The first link directed me to the Department of Census and Statistics – Wonderful!. I randomly click on Agriculture and what do I see at the top? Slaughter Statistics by District… Being the meat eater that I am, I decided that this could turn into an interesting, yet simple demo.
Self-service BI allows business users to mash-up data from different sources all by themselves, including corporate data that is periodically collected and processed for analytics purposes (data marts, cubes), corporate data that is lying around in various systems and previously not considered for analytics, data that comes off various devices deployed by the organization that exist in unstructured forms, and data from the public domain (tweets, exchange rates, stock market). The users can then perform data discovery using visual elements such as interactive graphs to get insight into their business.
My demo however, only focused on public data, since the main focus was to showcase to the audience how easy it is to pull in data from the Internet and make some sense out of it. Looking at the data source (which is just a web page), we see three tables; one each for cattle, goats/sheep and swine (pigs).
Each table has the district and the year with the intersection of these containing the number of souls that had departed to meet their maker. These tables are in a pivoted structure – good for reporting, but not good enough to create an analytic model. What I want it to look like, is something like this:
I also want add a new column to indicate the type of animal as well, because the source has three separate tables for each, and I want them all in one easy to use table; something like this:
Let’s get this started…
Deciding on a tool, I went with the tool of my choice, Excel 2013 and its free add-ins (part of Power BI), since a whole lot of us out there use Excel on a very regular basis. Other tools include Tableau and Qlikview.
Step 1: Extracting, Cleaning, Transforming and Storing the data
Data in a business intelligence solution is required to be in a clean and transformed state, so that it suits analysis/analytics. However, most source systems in the organization are structured differently, while some are not structured at all. Hence, we need this step called ETL (Extract, Transform and Load) that extracts source data, cleans it up, transforms it into an analysis-friendly structure before storing it in the BI system.
When doing Self-Service BI, the ETL step is usually performed by the business user themselves. Excel 2013 provides an add-in called Power Query to perform self-service ETL within the Power BI space. Power Query can connect to a variety of data sources including Hadoop clusters and Facebook. It allows for easy transformation of data, with a very small learning curve (or in most cases, no curve at all). Once the cleaning and transformation is done, the data needs to be loaded into a model. That’s where Power Pivot comes in. Power Pivot is an add-in that is used to model the data and store it along with the workbook. It is able hold millions and millions of records and manipulate them very fast.
So the first part of the demo is going to be that: Pull out the data, clean, transform and load it into a model. This video shows how the three tables of slaughter data is pulled in one-by-one, cleaned up and built into a query each. Finally all three queries are merged together to form one data set before being loaded into the data model. Notice the plethora of clean-up and transformation options that are available in Power Query.
In part 2, I will write about how to visualize and analyze this data. For now, if you are interested in trying out these stuff, Power Query is available here, and watch out for updated versions that are published on a regular basis. It is free and works only with Excel 2013. Power Query comes pre-built into Excel 2013, and can be enabled by going to File > Options > Add-ins > COM Add-ins and selecting Microsoft Office PowerPivot for Excel 2013 in Excel.