This blog post is the third and last in an ongoing series of blog posts on “Anybody can do data science – with SAP Analytics Cloud”.
My statistics professor once told me, that if you torture data long enough it will confess to anything. In this blog post, I will attempt to bring “science” into this interrogation room and explain these confessions from the perspective of data science. Our tool of choice for the interrogation will be SAP Analytics Cloud’s Smart Predict. My hope is to rouse your curiosity and inspire you to start using Smart Predict to answer your own questions
Part 3- Will I enjoy this wine?
In the first blog post of this series, we laid the ground work for solving our problem. We understood the data, loaded it into SAC and trained our very first model. In the second part, we understood the results and how we can use it to get predictions. In this last part, we return to the capability that is home base for SAP Analytics Cloud – which is visualising the data.
The previous 2 parts set us rolling with our basic regression model and predictions on wine quality. We could tell from the model’s influencer contributions which features had intrinsic variations that tied them closely to our predicted outcome, wine quality. Let’s now take a look at these influencers a little more closely and understand the variations that helped the model. We are hoping our adventures with data exploration will help us find a new feature that we did not provide to the model earlier, and may potentially improve our model.
So put on your Sherlock hat, grab your magnifying glass, ’cause we’re going sleuthing!
Step 5: Modify the dataset slightly
- Under the folder we have been saving our datasets in so far, you should find the original wine dataset.
- Click on winequality-white to edit it. On the extreme right, you will find the list of columns identified either as measures (numbers which you intend to aggregate) and dimensions (categories which you intend to group the numbers by). Notice that citric acid has incorrectly been identified as a dimension. We would also like quality to be a category, so we may analyse behaviour within it. We will flip the status of these two columns.
- Click on quality and change this to dimension.
- Click on citric acid and change it to a measure.
- Lets add a counter to the file, as a key does not exist. Click on the Custom Expression Editor icon.
- Write the definition for the counter, which will be a simple constant value of 1 and click on the tick mark to create the column.
- Save your file to complete the work. Remember windows shortcuts like Ctrl S (or Cmd S in Mac) work.
Step 6: Visualise the original data
Now if you remember, for our white wine quality prediction model, the key attributes contributing to a reliable prediction were density, residual sugar, volatile acidity and alcohol. Let’s see how these attributes vary as wine quality changes.
- Click on the plus sign at the top of your folder to create a new story.
- Choose a template to start with. Here I will select “Add a canvas page”, so I have a lot more freedom with design options. You might want to play with the options and settle for your own personal favourite.
- Select the chart option to make a bar chart of quality vs average density.
- Select the dataset for white wine, as these are the attributes we want to explore.
- We want to use the average Density in the chart against quality. In this case, average Density is the measure and Quality is the dimension. If we select Density from the default list, it would end up summing up densities within a quality bucket and that’s not what we want. So let’s define the average measure in a calculation.
- Select the “Create calculation” option.
- Under Type, select the Calculated Measure option, give the measure a name and define it as displayed below.
- Now select Quality as the dimension.
- Now, if we see the chart, we don’t notice a trend as the numbers are very close to each other. I will edit the formula slightly in order to tease out the trend.
- Doesn’t look much different does it? We can actually not spot the trend due to the scale of the y axis. Adjust the min and max of the y axis and see the effect. Click on the 3 tiny dots and edit axis.
- I set the min and max value for the y axis. Hit apply.
- Voila! There is a trend after all.
- Change the chart type of line if you will to see the dramatic dip in density in higher quality buckets.
- This would indeed have been a hard to identify trend, if we were not working backwards from the machine’s influencer variables. As average density increase, the wine quality appears to reduce. The shift in density is slight, but the trend is unmistakable and it is no surprise that the model picked this up. In practice, any feature that demonstrates a strong variability with the outcome should be fed into the predictive scenario model.
- For this toy example, I dipped into the model before playing with the data. But typically, the Exploratory data analysis (or EDA) is done before the modelling. We typically use the EDA phase to identify features or attributes that we think could be most useful for the model. We also create new features from a calculation of existing features and see if a correlation with the outcome can be observed. Accordingly we may add these features to our earlier model.
- I repeat the steps above to define a measures for Avg Sugar & Avg Alcohol (that I know helped the model) and observe their behaviour against Wine Quality. I also rename the sheet to Wine Playbook.
- I also discover, based on some reading up that Molecular Sulphur Dioxide may have a role to play in wine quality. The formula for which as referenced from this website is free sulphur dioxide / (1+ 10 ^ (pH – 1.81)). I use this formula to calculate this feature in the dataset.
- And use it in my front end to plot a new chart. Indeed there seems to be some recognisable trend. I can go back to my model and use this feature and see if it improves the model quality.
- I also add in a few KPI charts on the side.
- I jazz up the screen a bit and save my work.
- I want to see if my newly discovered feature helps my model, so I return to the predictive scenario and build a new model with Molecular Sulphur dioxide. I discover much to my delight, that we have narrowed our RMSE down a little more! The confidence at has shrunk ever so slightly. This is a measure of how robust our model is, so at 99% confidence we’re still doing pretty well. Effectively, I can now expect the quality score to be ±0.78 from the truth (as compared to ±0.80 earlier) with 99% confidence.
Insights and takeaways
- The data exploration in this last part of the blog series, has lead me to discover a feature for Molecular Sulphur dioxide, that is not present naturally in the data, but appears to have some correlation with quality score. I go back to my earlier model, use this new feature and discover that it does indeed improve model quality. This is called feature engineering which is an invariably iterative process. Each time a potential new feature is identified, we go back to the predictive scenario, create a new model and compare its performance with our previous models.
- Ultimately the best way to identify influencers for your model is visualise the data and identify patterns that you can feed to the machine and this is why a visualisation software will always be handy tool in the back pocket of the data scientist.
- In this case, we carved out a very small chunk for our predictions, but in practice we will have a more fleshed out dataset for prediction. After all, the predictions are why we are going through the whole process at all. The same steps illustrated here can be used to visualise the predictions and realise the vision of augmented analytics.
You can apply the learnings here, to any other problem statement. Define a problem, pick a dataset, load it into SAC, define the model parameters, train and apply the model to a new data set. In fact, the next time you are wondering which wine to buy, and are not sure if the seller’s wine rating is reliable, you know you now have a bespoke interrogation room tucked away in the cloud! Create an excel with attributes of the wine. Open up your SAC model and apply it to this new data set, and make an informed decision!
See? Really, anybody can do data science!!
Learn more about SAC
- Want to play around with SAP Analytics Cloud yourself? Download a trial version here.
- Locate the “?” symbol on the top of your SAC window to access the help documents, including guided playlists, knowledge base, browse community, etc.
Learn more about this problem
- Paper reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Available at: [Web Link]
- Data: The UC Irvine Machine Learning Repository here.