Friday, September 30, 2011

What to do when the data doesn’t fit the analytical question?

Smart response to this question can be – well, either we get the new data, or new question!
Let’s imagine our task is to find similarity between members of the same group, for example – home loan customers. Now, imagine the situation where we ONLY have a data for the home loans customers.
We can certainly examine all their characteristics, but there is no guarantee that they will be different from purchases of some other banking products. What we need is some point of reference. We need additional data of customers who have any other product other than home loans. So, in order to find out what is something similar about them, we need to figure what is different between them and anyone else – which is pretty much one and a same thing.
This is invariably classification problem which we try to solve by unary target variable (where all purchasers having the same value of the product purchased). So, since we don’t have, or are able to get - additional data for customers that have other types of products – we need to go for second-best scenario. So, instead of “reformulating” data through the artful and creative data preparation to better fit analytical question – we have no other option but to do exactly opposite – reformulating analytical question to fit the data at hand.
This would mean that our new question should be what are the groups of similarity within the single class of loan customers, and how do they differ from other groups of loan customers – as oppose to the original question of what makes my “loan” customers similar? This is now very different question and by reformulating our question we are also picking new “tool” from our workbench, so instead of using some classification algorithm we are reverting to clustering method.
So, the usual premise where data and analytical methods are functions of business question – doesn’t work in this situation, so practical solution is to alter the initial objective.     

Goran Dragosavac


  1. This information you provided in the blog that was really unique I love it!!, Thanks for sharing such a great blog Network Functions Virtualization (NFV) Market Report | Password Management Market Report

  2. Data Mining is a professional IT service we offer to our clients in Retail, Real Estate, Automobiles, and other business for their online data collection needs.