With my involvement in some data science work recently, I have had the privilege to explore a lot of tools of the trade – Rapid Miner, Python, Tensorflow and Azure Machine Learning to name a few. My experience has been highly enriching but I felt there was no Swiss knife that can handle the initial, and the most critical stage of a data science project: i.e. Hypothesis stage.

During this stage, scientists typically need to quickly prep the data, find the correlation patterns and establish hypotheses. It requires them to fail fast by identifying null hypotheses and spurious correlations and stay focussed on the right path. I recently explored Power BI and would like to share my findings through this blog.

Business Problem

Let us take a business case of a juice vendor, say Julie. Julie sells various kinds of juices and she collects some data about her business operations on a daily basis. Say, we have the following data for the month of July which looks like below. It is pretty much when, where, what and for how much.

Power BI for Data Scientists

Now, say I am a data scientist who is trying to help Julie to increase her sales and give her some insights as to what she should focus on to get the best bang for her buck. I have been tasked to build an estimation model for Julie based on simple linear regression.

Feature Engineering

I will start by analysing various correlations between the features and our target variable, i.e. Revenue. It can be commenced by importing the data into Power BI and looking after the following basics:

  1. Eliminate the null values with mean value of the feature
  2. Dedupe any rows
  3. Engineer some new features as below
FeatureDAX Formula
Day Type
Purpose of this feature is to distinguish between a week day and a weekend day. I wanted to test a hypothesis that weekend day might generate more sales than a week day.
Day Type = IF(WEEKDAY(Lemonade2016[Date],3) >= 5,"Weekend","Weekday")
Total Items SoldLemon + Orange
RevenueTotal Items Sold * Price

Data preparation and feature engineering was a breeze in Power BI, thanks to its extensive support of DAX, calculated columns and measures. The dataset looks like below now.

Power BI for Data Scientists 2

Hypotheses Development

Once we had our dataset ready in Power BI, the next task was to analyse the patterns between Revenue and other features.

Hypothesis 1: There is a positive correlation between Temperature and Revenue.

Result: Passed

Hypothesis 2: There are more sales on a weekend day.

Result: Failed

I derived these results using the below visualizations built briskly using Power BI platform

power bi platform

 

Next, off to some advanced hypothesis development. Shall we?

I needed to understand the relationship between the leaflets given on a particular day and their relationship with Revenue. Time to pull some heavy plumbing in, so I decided to tow R into in the mix. Power BI comes with inbuilt (almost!) support with R and I was able to quickly spawn a co plot using just 6-8 lines of R in the R Script Editor of Power BI

power BI r script editor

 

Interesting insight was how correlation differs based on the day. This was made possible using the Power BI slicer as shown below:

power_bi_slicer

Power BI + R = Advanced Insights

If you need to analyse the dynamics between various features and how this dynamics impacts your target variable, i.e. Revenue. You can easily model that in Power BI. Below is a dynamic co plot that shows the incremental causal relationship between Leaflets, Revenue and Temperature.

The 6 quadrants at the bottom should be read in conjunction with 6 steps in the top box. The bottom left is the first step and the top right the last step of leaflets. Basically it shows how the correlation between Temperature and Revenue is affected by leaflets bin size

power bi advanced insights

 

I ended my experiment by building a simple regression model that can give you prediction of your Revenue if you enter Temperature, Price and Leaflets. Below is the code for model in case you are keen

power BI r script editor 2

 

Power BI is a very simple and powerful tool for the exploratory data scientist in you. Give it a go.