Thursday, January 31, 2013


Nine Laws of Data Mining

by Tom Khabaza

This content was created during the first quarter of 2010 to publish the “Nine Laws of Data Mining”, which explain the reasons underlying the data mining process. If you prefer brevity, see my tweets: twitter.com/tomkhabaza. If you are a member of LinkedIn, see the “9 Laws of Data Mining” subgroup of the CRISP-DM group for a discussion forum. This page contains laws 1-4, with further laws on additional pages. The 9 Laws are also expressed as haikus here.

Data mining is the creation of new knowledge in natural or artificial form, by using business knowledge to discover and interpret patterns in data. In its current form, data mining as a field of practise came into existence in the 1990s, aided by the emergence of data mining algorithms packaged within workbenches so as to be suitable for business analysts. Perhaps because of its origins in practice rather than in theory, relatively little attention has been paid to understanding the nature of the data mining process. The development of the CRISP-DM methodology in the late 1990s was a substantial step towards a standardised description of the process that had already been found successful and was (and is) followed by most practising data miners.

 

Although CRISP-DM describes how data mining is performed, it does not explain what data mining is or why the process has the properties that it does. In this paper I propose nine maxims or “laws” of data mining (most of which are well-known to practitioners), together with explanations where known. This provides the start of a theory to explain (and not merely describe) the data mining process.

It is not my purpose to criticise CRISP-DM; many of the concepts introduced by CRISP-DM are crucial to the understanding of data mining outlined here, and I also depend on CRISP-DM’s common terminology. This is merely the next step in the process that started with CRISP-DM.

——————————————————————————————————

1st Law of Data Mining – “Business Goals Law”:

Business objectives are the origin of every data mining solution

This defines the field of data mining: data mining is concerned with solving business problems and achieving business goals. Data mining is not primarily a technology; it is a process, which has one or more business objectives at its heart. Without a business objective (whether or not this is articulated), there is no data mining.

Hence the maxim: “Data Mining is a Business Process”.

——————————————————————————————————

2nd Law of Data Mining – “Business Knowledge Law”:
Business knowledge is central to every step of the data mining process

This defines a crucial characteristic of the data mining process. A naive reading of CRISP-DM would see business knowledge used at the start of the process in defining goals, and at the end of the process in guiding deployment of results. This would be to miss a key property of the data mining process, that business knowledge has a central role in every step.

For convenience I use the CRISP-DM phases to illustrate:

· Business understanding must be based on business knowledge, and so must the mapping of business objectives to data mining goals. (This mapping is also based on data knowledge data mining knowledge).

· Data understanding uses business knowledge to understand which data is related to the business problem, and how it is related.

· Data preparation means using business knowledge to shape the data so that the required business questions can be asked and answered. (For further detail see the 3rd Law – the Data Preparation law).

· Modelling means using data mining algorithms to create predictive models and interpreting both the models and their behaviour in business terms – that is, understanding their business relevance.

· Evaluation means understanding the business impact of using the models.

· Deployment means putting the data mining results to work in a business process.

In summary, without business knowledge, not a single step of the data mining process can be effective; there are no “purely technical” steps. Business knowledge guides the process towards useful results, and enables the recognition of those results that are useful. Data mining is an iterative process, with business knowledge at its core, driving continual improvement of results.

The reason behind this can be explained in terms of the “chasm of representation” (an idea used by Alan Montgomery in data mining presentations of the 1990s). Montgomery pointed out that the business goals in data mining refer to the reality of the business, whereas investigation takes place at the level of data which is only a representation of that reality; there is a gap (or “chasm”) between what is represented in the data and what takes place in the real world. In data mining, business knowledge is used to bridge this gap; whatever is found in the data has significance only when interpreted using business knowledge, and anything missing from the data must be provided through business knowledge. Only business knowledge can bridge the gap, which is why it is central to every step of the data mining process.

——————————————————————————————————

3rd Law of Data Mining – “Data Preparation Law”:

Data preparation is more than half of every data mining process

It is a well-known maxim of data mining that most of the effort in a data mining project is spent in data acquisition and preparation. Informal estimates vary from 50 to 80 percent. Naive explanations might be summarised as “data is difficult”, and moves to automate various parts of data acquisition, data cleaning, data transformation and data preparation are often viewed as attempts to mitigate this “problem”. While automation can be beneficial, there is a risk that proponents of this technology will believe that it can remove the large proportion of effort which goes into data preparation. This would be to misunderstand the reasons why data preparation is required in data mining.

The purpose of data preparation is to put the data into a form in which the data mining question can be asked, and to make it easier for the analytical techniques (such as data mining algorithms) to answer it. Every change to the data of any sort (including cleaning, large and small transformations, and augmentation) means a change to the problem space which the analysis must explore. The reason that data preparation is important, and forms such a large proportion of data mining effort, is that the data miner is deliberately manipulating the problem space to make it easier for their analytical techniques to find a solution.

There are two aspects to this “problem space shaping”. The first is putting the data into a form in which it can be analysed at all – for example, most data mining algorithms require data in a single table, with one record per example. The data miner knows this as a general parameter of what the algorithm can do, and therefore puts the data into a suitable format. The second aspect is making the data more informative with respect to the business problem – for example, certain derived fields or aggregates may be relevant to the data mining question; the data miner knows this through business knowledge and data knowledge. By including these fields in the data, the data miner manipulates the search space to make it possible or easier for their preferred techniques to find a solution.

It is therefore essential that data preparation is informed in detail by business knowledge, data knowledge and data mining knowledge. These aspects of data preparation cannot be automated in any simple way.

This law also explains the otherwise paradoxical observation that even after all the data acquisition, cleaning and organisation that goes into creating a data warehouse, data preparation is still crucial to, and more than half of, the data mining process. Furthermore, even after a major data preparation stage, further data preparation is often required during the iterative process of building useful models, as shown in the CRISP-DM diagram.
——————————————————————————————————

4th Law of Data Mining – “NFL-DM”:

The right model for a given application can only be discovered by experiment

or “There is No Free Lunch for the Data Miner”

It is an axiom of machine learning that, if we knew enough about a problem space, we could choose or design an algorithm to find optimal solutions in that problem space with maximal efficiency. Arguments for the superiority of one algorithm over others in data mining rest on the idea that data mining problem spaces have one particular set of properties, or that these properties can be discovered by analysis and built into the algorithm. However, these views arise from the erroneous idea that, in data mining, the data miner formulates the problem and the algorithm finds the solution. In fact, the data miner both formulates the problem and finds the solution – the algorithm is merely a tool which the data miner uses to assist with certain steps in this process.

There are 5 factors which contribute to the necessity for experiment in finding data mining solutions:

1. If the problem space were well-understood, the data mining process would not be needed – data mining is the process of searching for as yet unknown connections.

2. For a given application, there is not only one problem space; different models may be used to solve different parts of the problem, and the way in which the problem is decomposed is itself often the result of data mining and not known before the process begins.

3. The data miner manipulates, or “shapes”, the problem space by data preparation, so that the grounds for evaluating a model are constantly shifting.

4. There is no technical measure of value for a predictive model (see 8th law).

5. The business objective itself undergoes revision and development during the data mining process, so that the appropriate data mining goals may change completely.

This last point, the ongoing development of business objectives during data mining, is implied by CRISP-DM but is often missed. It is widely known that CRISP-DM is not a “waterfall” process in which each phase is completed before the next begins. In fact, any CRISP-DM phase can continue throughout the project, and this is as true for Business Understanding as it is for any other phase. The business objective is not simply given at the start, it evolves throughout the process. This may be why some data miners are willing to start projects without a clear business objective – they know that business objectives are also a result of the process, and not a static given.

Wolpert’s “No Free Lunch” (NFL) theorem, as applied to machine learning, states that no one bias (as embodied in an algorithm) will be better than any other when averaged across all possible problems (datasets). This is because, if we consider all possible problems, their solutions are evenly distributed, so that an algorithm (or bias) which is advantageous for one subset will be disadvantageous for another. This is strikingly similar to what all data miners know, that no one algorithm is the right choice for every problem. Yet the problems or datasets tackled by data mining are anything but random, and most unlikely to be evenly distributed across the space of all possible problems – they represent a very biased sample, so why should the conclusions of NFL apply? The answer relates to the factors given above: because problem spaces are initially unknown, because multiple problem spaces may relate to each data mining goal, because problem spaces may be manipulated by data preparation, because models cannot be evaluated by technical means, and because the business problem itself may evolve. For all these reasons, data mining problem spaces are developed by the data mining process, and subject to constant change during the process, so that the conditions under which the algorithms operate mimic a random selection of datasets and Wopert’s NFL theorem therefore applies. There is no free lunch for the data miner.

This describes the data mining process in general. However, there may well be cases where the ground is already “well-trodden” – the business goals are stable, the data and its pre-processing are stable, an acceptable algorithm or algorithms and their
role(s) in the solution have been discovered and settled upon. In these situations, some of the properties of the generic data mining process are lessened. Such stability is temporary, because both the relation of the data to the business (see 2nd law) and our understanding of the problem (see 9th law) will change. However, as long this stability lasts, the data miner’s lunch may be free, or at least relatively inexpensive.

 

5th Law of Data Mining – “Watkins’ Law”: There are always patterns

This law was first stated by David Watkins. We might expect that a proportion of data mining projects would fail because the patterns needed to solve the business problem are not present in the data, but this does not accord with the experience of practising data miners.

Previous explanations have suggested that this is because:

There is always something interesting to be found in a business-relevant dataset, so that even if the expected patterns were not found, something else useful would be found (this does accord with data miners’ experience), and

A data mining project would not be undertaken unless business experts expected that patterns would be present, and it should not be surprising that the experts are usually right.

However, Watkins formulated this in a simpler and more direct way: “There are always patterns.”, and this accords more accurately with the experience of data miners than either of the previous explanations. Watkins later amended this to mean that in data mining projects about customer relationships, there are always patterns connecting customers’ previous behaviour with their future behaviour, and that these patterns can be used profitably (“Watkins’ CRM Law”). However, data miners’ experience is that this is not limited to CRM problems – there are always patterns in any data mining problem (“Watkins’ General Law”).

The explanation of Watkins’ General Law is as follows:

· The business objective of a data mining project defines the domain of interest, and this is reflected in the data mining goal.

· Data relevant to the business objective and consequent data mining goal is generated by processes within the domain.

· These processes are governed by rules, and the data that is generated by the processes reflects those rules.

· In these terms, the purpose of the data mining process is to reveal the domain rules by combining pattern-discovery technology (data mining algorithms) with the business knowledge required to interpret the results of the algorithms in terms of the domain.

· Data mining requires relevant data, that is data generated by the domain processes in question, which inevitably holds patterns from the rules which govern these processes.

To summarise this argument: there are always patterns because they are an inevitable by-product of the processes which produce the data. To find the patterns, start from the process or what you know of it – the business knowledge.

Discovery of these patterns also forms an iterative process with business knowledge; the patterns contribute to business knowledge, and business knowledge is the key component required to interpret the patterns. In this iterative process, data mining algorithms simply link business knowledge to patterns which cannot be observed with the naked eye.

If this explanation is correct, then Watkins’ law is entirely general. There will always be patterns for every data mining problem in every domain unless there is no relevant data; this is guaranteed by the definition of relevance.

——————————————————————————————————

6th Law of Data Mining – “Insight Law”:
Data mining amplifies perception in the business domain

How does data mining produce insight? This law approaches the heart of data mining – why it must be a business process and not a technical one. Business problems are solved by people, not by algorithms. The data miner and the business expert “see” the solution to a problem, that is the patterns in the domain that allow the business objective to be achieved. Thus data mining is, or assists as part of, a perceptual process. Data mining algorithms reveal patterns that are not normally visible to human perception. The data mining process integrates these algorithms with the normal human perceptual process, which is active in nature. Within the data mining process, the human problem solver interprets the results of data mining algorithms and integrates them into their business understanding, and thence into a business process.

This is similar to the concept of an “intelligence amplifier”. Early in the field of Artificial Intelligence, it was suggested that the first practical outcomes from AI would be not intelligent machines, but rather tools which acted as “intelligence amplifiers”, assisting human users by boosting their mental capacities and therefore their effective intelligence. Data mining provides a kind of intelligence amplifier, helping business experts to solve business problems in a way which they could not achieve unaided.

In summary: Data mining algorithms provide a capability to detect patterns beyond normal human capabilities. The data mining process allows data miners and business experts to integrate this capability into their own problem solving and into business processes.

——————————————————————————————————

7th Law of Data Mining – “Prediction Law”:
Prediction increases information locally by generalisation

The term “prediction” has become the accepted description of what data mining models do – we talk about “predictive models” and “predictive analytics”. This is because some of the most popular data mining models are often used to “predict the most likely outcome” (as well as indicating how likely the outcome may be). This is the typical use of classification and regression models in data mining solutions.

However, other kinds of data mining models, such as clustering and association models, are also characterised as “predictive”; this is a much looser sense of the term. A clustering model might be described as “predicting” the group into which an individual falls, and an association model might be described as “predicting” one or more attributes on the basis of those that are known.

Similarly we might analyse the use of the term “predict” in different domains: a classification model might be said to predict customer behaviour – more properly we might say that it predicts which customers should be targeted in a certain way, even though not all the targeted individuals will behave in the “predicted” manner. A fraud detection model might be said to predict whether individual transactions should be treated as high-risk, even though not all those so treated are in fact cases of fraud.

These broad uses of the term “prediction” have led to the term “predictive analytics” as an umbrella term for data mining and the application of its results in business solutions. But we should remain aware that this is not the ordinary everyday meaning of “prediction” – we cannot expect to predict the behaviour of a specific individual, or the outcome of a specific fraud investigation.

What, then, is “prediction” in this sense? What do classification, regression, clustering and association algorithms and their resultant models have in common? The answer lies in “scoring”, that is the application of a predictive model to a new example. The model produces a prediction, or score, which is a new piece of information about the example. The available information about the example in question has been increased, locally, on the basis of the patterns found by the algorithm and embodied in the model, that is on the basis of generalisation or induction. It is important to remember that this new information is not “data”, in the sense of a “given”; it is information only in the statistical sense.

——————————————————————————————————

8th Law of Data Mining – “Value Law”:

The value of data mining results is not determined by the accuracy or stability
of predictive models

Accuracy and stability are useful measures of how well a predictive model makes its predictions. Accuracy means how often the predictions are correct (where they are truly predictions) and stability means how much (or rather how little) the predictions would change if the data used to create the model were a different sample from the same population. Given the central role of the concept of prediction in data mining, the accuracy and stability of a predictive model might be expected to determine its value, but this is not the case.

The value of a predictive model arises in two ways:

The model’s predictions drive improved (more effective) action, and

The model delivers insight (new knowledge) which leads to improved strategy.

In the case of insight, accuracy is connected only loosely to the value of any new knowledge delivered. Some predictive capability may be necessary to convince us that the discovered patterns are real. However, a model which is incomprehensibly complex or totally opaque may be highly accurate in its predictions, yet deliver no useful insight, whereas a simpler and less accurate model may be much more useful for delivering insight.

The disconnect between accuracy and value in the case of improved action is less obvious, but still present, and can be highlighted by the question “Is the model predicting the right thing, and for the right reasons?” In other words, the value of a model derives as much from of its fit to the business problem as it does from its predictive accuracy. For example, a customer attrition model might make highly accurate predictions, yet make its predictions too late for the business to act on them effectively. Alternatively an accurate customer attrition model might drive effective action to retain customers, but only for the least profitable subset of customers. A high degree of accuracy does not enhance the value of these models when they have a poor fit to the business problem.

The same is true of model stability; although an interesting measure for predictive models, stability cannot be substituted for the ability of a model to provide business insight, or for its fit to the business problem. Neither can any other technical measure.

In summary, the value of a predictive model is not determined by any technical measure. Data miners should not focus on predictive accuracy, model stability, or any other technical metric for predictive models at the expense of business insight and business fit.

——————————————————————————————————

9th Law of Data Mining – “Law of Change”: All patterns are subject to change

The patterns discovered by data mining do not last forever. This is well-known in many applications of data mining, but the universality of this property and the reasons for it are less widely appreciated.

In marketing and CRM applications of data mining, it is well-understood that patterns of customer behaviour are subject to change over time. Fashions change, markets and competition change, and the economy changes as a whole; for all these reasons, predictive models become out-of-date and should be refreshed regularly or when they cease to predict accurately.

The same is true in risk and fraud-related applications of data mining. Patterns of fraud change with a changing environment and because criminals change their behaviour in order to stay ahead of crime prevention efforts. Fraud detection applications must therefore be designed to detect new, unknown types of fraud, just as they must deal with old and familiar ones.

Some kinds of data mining might be thought to find patterns which will not change over time – for example in scientific applications of data mining, do we not discover unchanging universal laws? Perhaps surprisingly, the answer is that even these patterns should be expected to change.

The reason is that patterns are not simply regularities which exist in the world and are reflected in the data – these regularities may indeed be static in some domains. Rather, the patterns discovered by data mining are part of a perceptual process, an active process in which data mining mediates between the world as described by the data and the understanding of the observer or business expert. Because our understanding continually develops and grows, so we should expect the patterns also to change. Tomorrow’s data may look superficially similar, but it will have been collected by different means, for (perhaps subtly) different purposes, and have different semantics; the analysis process, because it is driven by business knowledge, will change as that knowledge changes. For all these reasons, the patterns will be different.

To express this briefly, all patterns are subject to change because they reflect not only a changing world but also our changing understanding

 Postscript

The 9 Laws of Data Mining are simple truths about data mining. Most of the 9 laws are already well-known to data miners, although some are expressed in an unfamiliar way (for example, the 5th, 6th and 7th laws). Most of the new ideas associated with the 9 laws are in the explanations, which express an attempt to understand the reasons behind the well-known form of the data mining process.

Why should we care why the data mining process takes the form that it does? In addition to the simple appeal of knowledge and understanding, there is a practical reason to pursue these questions.

The data mining process came into being in the form that exists today because of technological developments – the widespread availability of machine learning algorithms, and the development of workbenches which integrated these algorithms with other techniques and make them accessible to users with a business-oriented outlook. Should we expect technological change to change the data mining process? Eventually it must, but if we understand the reasons for the form of the process, then we can distinguish between technology which might change it and technology which cannot.

Several technological developments have been hailed as revolutions in predictive analytics, for example the advent of automated data preparation and model re-building, and the integration of business rules with predictive models in deployment frameworks. The 9 laws of data mining suggest, and their explanations demonstrate, that these developments will not change the nature of the process. The 9 laws, and further development of these ideas, should be used to judge any future claims of revolutionising the data mining process, in addition to their educational value for data miners.

I would like to thank Chris Thornton and David Watkins, who supplied the insights which inspired this work, and also to thank all those who have contributed to the LinkedIn “9 Laws of Data Mining” discussion group, which has provided invaluable food for thought.

 

No comments:

Post a Comment