Wednesday, August 10, 2011

What constitutes a good data mining model?

There are different types of data mining models, so definition of good quality model will depend of type of the model.
Good explanatory model must be able to explain some facet of the business problem. Purpose of describtive models is to extract the patterns in the data that are non-trivial, unknown, potentially useful and actionable. Such a model should bring you deeper in the understanding of specific business phenomena, and if acted upon - these new insights can generate new business value.
Predictive models are different. The purpose of predictive models is to generalize well on the set of new data. First, we have to be able to compare the results to what actually happened in the real world. Did predicted behavior actually happened, how many times model was right, or wrong? What is the improvement of the model in comparison to pre-modeling levels?  Here, basic assessment metrics that are used to choose the best model are accuracy rates, misclassification rates, lift, average squared error, etc.  
The question that I have been asked many times by business audiences is how they can trust the model, since they are required not only to sponsor model implementation, but also to stake their reputations in technologies that they often don’t quite understand.  My response is always to look at the assessment measures on test data. How model performs on test dataset is the closest we will ever be to assess model performance on a new dataset, where model is required to generate accurate prediction.
Model accuracy is only one of the qualitative aspects, but there are others – such as stability. At the same levels of accuracy it is always better to go for simpler model with the fewer variables since such models are always more robust and stable.
Another angle of what constitute good model comes purely from a business perspective. Near perfect models from a statistical perspective are of no use they cannot be implemented for whatever reason. On the other hand - we may have models that fall short of statistically sound model – but who can still help us do things better than what we are able to do in absence of such model.
And lastly – main question remains – how does benefits generated by the model compare with its cost of production and implementation? Benefits of the good model always outweigh its cost.
Goran Dragosavac