Nine Laws of Data Mining
by Tom Khabaza
This content was created
during the first quarter of 2010 to publish the “Nine Laws of Data Mining”,
which explain the reasons underlying the data mining process. If you prefer
brevity, see my tweets: twitter.com/tomkhabaza. If you are a
member of LinkedIn, see the “9 Laws
of Data Mining” subgroup of the CRISP-DM group for a discussion forum. This
page contains laws 1-4, with further laws on additional pages. The 9 Laws are
also expressed as haikus here.
Data mining is the creation of
new knowledge in natural or artificial form, by using business knowledge to
discover and interpret patterns in data. In its current form, data mining as a
field of practise came into existence in the 1990s, aided by the emergence of
data mining algorithms packaged within workbenches so as to be suitable for
business analysts. Perhaps because of its origins in practice rather than in
theory, relatively little attention has been paid to understanding the nature
of the data mining process. The development of the CRISP-DM methodology in the
late 1990s was a substantial step towards a standardised description of the
process that had already been found successful and was (and is) followed by
most practising data miners.
Although CRISP-DM describes
how data mining is performed, it does not explain what data mining is or why
the process has the properties that it does. In this paper I propose nine
maxims or “laws” of data mining (most of which are well-known to
practitioners), together with explanations where known. This provides the start
of a theory to explain (and not merely describe) the data mining process.
It is not my purpose to
criticise CRISP-DM; many of the concepts introduced by CRISP-DM are crucial to
the understanding of data mining outlined here, and I also depend on CRISP-DM’s
common terminology. This is merely the next step in the process that started
with CRISP-DM.
——————————————————————————————————
1st Law of Data Mining –
“Business Goals Law”:
Business objectives are the
origin of every data mining solution
This defines the field of data
mining: data mining is concerned with solving business problems and achieving
business goals. Data mining is not primarily a technology; it is a process,
which has one or more business objectives at its heart. Without a business
objective (whether or not this is articulated), there is no data mining.
Hence the maxim: “Data Mining is
a Business Process”.
——————————————————————————————————
2nd Law of Data Mining –
“Business Knowledge Law”:
Business knowledge is central to every step of the data mining process
This defines a crucial
characteristic of the data mining process. A naive reading of CRISP-DM would
see business knowledge used at the start of the process in defining goals, and
at the end of the process in guiding deployment of results. This would be to
miss a key property of the data mining process, that business knowledge has a
central role in every step.
For convenience I use the
CRISP-DM phases to illustrate:
· Business understanding must be based on business knowledge, and so must
the mapping of business objectives to data mining goals. (This mapping is also
based on data knowledge data mining knowledge).
· Data understanding uses business knowledge to understand which data is
related to the business problem, and how it is related.
· Data preparation means using business knowledge to shape the data so
that the required business questions can be asked and answered. (For further
detail see the 3rd Law – the Data Preparation law).
· Modelling means using data mining algorithms to create predictive models
and interpreting both the models and their behaviour in business terms – that
is, understanding their business relevance.
· Evaluation means understanding the business impact of using the models.
· Deployment means putting the data mining results to work in a business
process.
In summary, without business
knowledge, not a single step of the data mining process can be effective; there
are no “purely technical” steps. Business knowledge guides the process towards
useful results, and enables the recognition of those results that are useful.
Data mining is an iterative process, with business knowledge at its core,
driving continual improvement of results.
The reason behind this can be
explained in terms of the “chasm of representation” (an idea used by Alan
Montgomery in data mining presentations of the 1990s). Montgomery pointed out
that the business goals in data mining refer to the reality of the business,
whereas investigation takes place at the level of data which is only a
representation of that reality; there is a gap (or “chasm”) between what is
represented in the data and what takes place in the real world. In data mining,
business knowledge is used to bridge this gap; whatever is found in the data
has significance only when interpreted using business knowledge, and anything
missing from the data must be provided through business knowledge. Only
business knowledge can bridge the gap, which is why it is central to every step
of the data mining process.
——————————————————————————————————
3rd Law of Data Mining – “Data
Preparation Law”:
Data preparation is more than
half of every data mining process
It is a well-known maxim of
data mining that most of the effort in a data mining project is spent in data
acquisition and preparation. Informal estimates vary from 50 to 80 percent.
Naive explanations might be summarised as “data is difficult”, and moves to
automate various parts of data acquisition, data cleaning, data transformation
and data preparation are often viewed as attempts to mitigate this “problem”.
While automation can be beneficial, there is a risk that proponents of this technology
will believe that it can remove the large proportion of effort which goes into
data preparation. This would be to misunderstand the reasons why data
preparation is required in data mining.
The purpose of data
preparation is to put the data into a form in which the data mining question
can be asked, and to make it easier for the analytical techniques (such as data
mining algorithms) to answer it. Every change to the data of any sort
(including cleaning, large and small transformations, and augmentation) means a
change to the problem space which the analysis must explore. The reason that
data preparation is important, and forms such a large proportion of data mining
effort, is that the data miner is deliberately manipulating the problem space
to make it easier for their analytical techniques to find a solution.
There are two aspects to this
“problem space shaping”. The first is putting the data into a form in which it
can be analysed at all – for example, most data mining algorithms require data
in a single table, with one record per example. The data miner knows this as a
general parameter of what the algorithm can do, and therefore puts the data
into a suitable format. The second aspect is making the data more informative
with respect to the business problem – for example, certain derived fields or
aggregates may be relevant to the data mining question; the data miner knows
this through business knowledge and data knowledge. By including these fields
in the data, the data miner manipulates the search space to make it possible or
easier for their preferred techniques to find a solution.
It is therefore essential that
data preparation is informed in detail by business knowledge, data knowledge
and data mining knowledge. These aspects of data preparation cannot be
automated in any simple way.
This law also explains the
otherwise paradoxical observation that even after all the data acquisition,
cleaning and organisation that goes into creating a data warehouse, data
preparation is still crucial to, and more than half of, the data mining
process. Furthermore, even after a major data preparation stage, further data
preparation is often required during the iterative process of building useful
models, as shown in the CRISP-DM diagram.
——————————————————————————————————
4th Law of Data Mining –
“NFL-DM”:
The right model for a given
application can only be discovered by experiment
or “There is No Free Lunch
for the Data Miner”
It is an axiom of machine
learning that, if we knew enough about a problem space, we could choose or
design an algorithm to find optimal solutions in that problem space with
maximal efficiency. Arguments for the superiority of one algorithm over others
in data mining rest on the idea that data mining problem spaces have one
particular set of properties, or that these properties can be discovered by
analysis and built into the algorithm. However, these views arise from the
erroneous idea that, in data mining, the data miner formulates the problem and
the algorithm finds the solution. In fact, the data miner both formulates the
problem and finds the solution – the algorithm is merely a tool which the data
miner uses to assist with certain steps in this process.
There are 5 factors which
contribute to the necessity for experiment in finding data mining solutions:
1. If the problem space were well-understood, the data mining process would
not be needed – data mining is the process of searching for as yet unknown
connections.
2. For a given application, there is not only one problem space; different
models may be used to solve different parts of the problem, and the way in
which the problem is decomposed is itself often the result of data mining and
not known before the process begins.
3. The data miner manipulates, or “shapes”, the problem space by data
preparation, so that the grounds for evaluating a model are constantly
shifting.
4. There is no technical measure of value for a predictive model (see 8th
law).
5. The business objective itself undergoes revision and development during
the data mining process, so that the appropriate data mining goals may change
completely.
This last point, the ongoing
development of business objectives during data mining, is implied by CRISP-DM
but is often missed. It is widely known that CRISP-DM is not a “waterfall”
process in which each phase is completed before the next begins. In fact, any
CRISP-DM phase can continue throughout the project, and this is as true for
Business Understanding as it is for any other phase. The business objective is
not simply given at the start, it evolves throughout the process. This may be
why some data miners are willing to start projects without a clear business
objective – they know that business objectives are also a result of the
process, and not a static given.
Wolpert’s “No Free Lunch”
(NFL) theorem, as applied to machine learning, states that no one bias (as
embodied in an algorithm) will be better than any other when averaged across
all possible problems (datasets). This is because, if we consider all possible
problems, their solutions are evenly distributed, so that an algorithm (or
bias) which is advantageous for one subset will be disadvantageous for another.
This is strikingly similar to what all data miners know, that no one algorithm
is the right choice for every problem. Yet the problems or datasets tackled by
data mining are anything but random, and most unlikely to be evenly distributed
across the space of all possible problems – they represent a very biased
sample, so why should the conclusions of NFL apply? The answer relates to the
factors given above: because problem spaces are initially unknown, because
multiple problem spaces may relate to each data mining goal, because problem
spaces may be manipulated by data preparation, because models cannot be
evaluated by technical means, and because the business problem itself may
evolve. For all these reasons, data mining problem spaces are developed by the
data mining process, and subject to constant change during the process, so that
the conditions under which the algorithms operate mimic a random selection of
datasets and Wopert’s NFL theorem therefore applies. There is no free lunch for
the data miner.
This describes the data mining
process in general. However, there may well be cases where the ground is
already “well-trodden” – the business goals are stable, the data and its
pre-processing are stable, an acceptable algorithm or algorithms and their
role(s) in the solution have been discovered and settled upon. In these
situations, some of the properties of the generic data mining process are
lessened. Such stability is temporary, because both the relation of the data to
the business (see 2nd law) and our understanding of the problem (see 9th law)
will change. However, as long this stability lasts, the data miner’s lunch may
be free, or at least relatively inexpensive.
5th Law of Data Mining –
“Watkins’ Law”: There are always patterns
This law was first
stated by David Watkins. We might expect that a proportion of data mining
projects would fail because the patterns needed to solve the business problem
are not present in the data, but this does not accord with the experience of
practising data miners.
Previous explanations
have suggested that this is because:
There is always
something interesting to be found in a business-relevant dataset, so that even
if the expected patterns were not found, something else useful would be found
(this does accord with data miners’ experience), and
A data mining project
would not be undertaken unless business experts expected that patterns would be
present, and it should not be surprising that the experts are usually right.
However, Watkins
formulated this in a simpler and more direct way: “There are always patterns.”,
and this accords more accurately with the experience of data miners than either
of the previous explanations. Watkins later amended this to mean that in data
mining projects about customer relationships, there are always patterns
connecting customers’ previous behaviour with their future behaviour, and that
these patterns can be used profitably (“Watkins’ CRM Law”). However, data
miners’ experience is that this is not limited to CRM problems – there are always
patterns in any data mining problem (“Watkins’ General Law”).
The explanation of
Watkins’ General Law is as follows:
· The
business objective of a data mining project defines the domain of interest, and
this is reflected in the data mining goal.
· Data
relevant to the business objective and consequent data mining goal is generated
by processes within the domain.
· These
processes are governed by rules, and the data that is generated by the
processes reflects those rules.
· In
these terms, the purpose of the data mining process is to reveal the domain
rules by combining pattern-discovery technology (data mining algorithms) with
the business knowledge required to interpret the results of the algorithms in
terms of the domain.
· Data
mining requires relevant data, that is data generated by the domain processes
in question, which inevitably holds patterns from the rules which govern these
processes.
To summarise this
argument: there are always patterns because they are an inevitable by-product
of the processes which produce the data. To find the patterns, start from the
process or what you know of it – the business knowledge.
Discovery of these
patterns also forms an iterative process with business knowledge; the patterns
contribute to business knowledge, and business knowledge is the key component
required to interpret the patterns. In this iterative process, data mining
algorithms simply link business knowledge to patterns which cannot be observed
with the naked eye.
If this explanation is
correct, then Watkins’ law is entirely general. There will always be patterns
for every data mining problem in every domain unless there is no relevant data;
this is guaranteed by the definition of relevance.
——————————————————————————————————
6th Law of Data Mining –
“Insight Law”:
Data
mining amplifies perception in the business domain
How does data mining
produce insight? This law approaches the heart of data mining – why it must be
a business process and not a technical one. Business problems are solved by
people, not by algorithms. The data miner and the business expert “see” the
solution to a problem, that is the patterns in the domain that allow the
business objective to be achieved. Thus data mining is, or assists as part of,
a perceptual process. Data mining algorithms reveal patterns that are not
normally visible to human perception. The data mining process integrates these
algorithms with the normal human perceptual process, which is active in nature.
Within the data mining process, the human problem solver interprets the results
of data mining algorithms and integrates them into their business
understanding, and thence into a business process.
This is similar to the
concept of an “intelligence amplifier”. Early in the field of Artificial
Intelligence, it was suggested that the first practical outcomes from AI would
be not intelligent machines, but rather tools which acted as “intelligence
amplifiers”, assisting human users by boosting their mental capacities and
therefore their effective intelligence. Data mining provides a kind of
intelligence amplifier, helping business experts to solve business problems in
a way which they could not achieve unaided.
In summary: Data mining
algorithms provide a capability to detect patterns beyond normal human
capabilities. The data mining process allows data miners and business experts
to integrate this capability into their own problem solving and into business
processes.
——————————————————————————————————
7th Law of Data Mining –
“Prediction Law”:
Prediction increases information locally by generalisation
The term “prediction”
has become the accepted description of what data mining models do – we talk
about “predictive models” and “predictive analytics”. This is because some of
the most popular data mining models are often used to “predict the most likely
outcome” (as well as indicating how likely the outcome may be). This is the
typical use of classification and regression models in data mining solutions.
However, other kinds of
data mining models, such as clustering and association models, are also
characterised as “predictive”; this is a much looser sense of the term. A
clustering model might be described as “predicting” the group into which an
individual falls, and an association model might be described as “predicting”
one or more attributes on the basis of those that are known.
Similarly we might
analyse the use of the term “predict” in different domains: a classification
model might be said to predict customer behaviour – more properly we might say
that it predicts which customers should be targeted in a certain way, even
though not all the targeted individuals will behave in the “predicted” manner.
A fraud detection model might be said to predict whether individual
transactions should be treated as high-risk, even though not all those so
treated are in fact cases of fraud.
These broad uses of the
term “prediction” have led to the term “predictive analytics” as an umbrella
term for data mining and the application of its results in business solutions.
But we should remain aware that this is not the ordinary everyday meaning of
“prediction” – we cannot expect to predict the behaviour of a specific
individual, or the outcome of a specific fraud investigation.
What, then, is
“prediction” in this sense? What do classification, regression, clustering and
association algorithms and their resultant models have in common? The answer
lies in “scoring”, that is the application of a predictive model to a new
example. The model produces a prediction, or score, which is a new piece of
information about the example. The available information about the example in
question has been increased, locally, on the basis of the patterns found by the
algorithm and embodied in the model, that is on the basis of generalisation or
induction. It is important to remember that this new information is not “data”,
in the sense of a “given”; it is information only in the statistical sense.
——————————————————————————————————
8th Law of Data Mining –
“Value Law”:
The value of data mining
results is not determined by the accuracy or stability
of predictive models
Accuracy and stability
are useful measures of how well a predictive model makes its predictions.
Accuracy means how often the predictions are correct (where they are truly predictions)
and stability means how much (or rather how little) the predictions would
change if the data used to create the model were a different sample from the
same population. Given the central role of the concept of prediction in data
mining, the accuracy and stability of a predictive model might be expected to
determine its value, but this is not the case.
The value of a
predictive model arises in two ways:
The model’s predictions
drive improved (more effective) action, and
The model delivers
insight (new knowledge) which leads to improved strategy.
In the case of insight,
accuracy is connected only loosely to the value of any new knowledge delivered.
Some predictive capability may be necessary to convince us that the discovered
patterns are real. However, a model which is incomprehensibly complex or
totally opaque may be highly accurate in its predictions, yet deliver no useful
insight, whereas a simpler and less accurate model may be much more useful for
delivering insight.
The disconnect between
accuracy and value in the case of improved action is less obvious, but still
present, and can be highlighted by the question “Is the model predicting the
right thing, and for the right reasons?” In other words, the value of a model
derives as much from of its fit to the business problem as it does from its
predictive accuracy. For example, a customer attrition model might make highly
accurate predictions, yet make its predictions too late for the business to act
on them effectively. Alternatively an accurate customer attrition model might
drive effective action to retain customers, but only for the least profitable
subset of customers. A high degree of accuracy does not enhance the value of
these models when they have a poor fit to the business problem.
The same is true of
model stability; although an interesting measure for predictive models,
stability cannot be substituted for the ability of a model to provide business
insight, or for its fit to the business problem. Neither can any other
technical measure.
In summary, the value of
a predictive model is not determined by any technical measure. Data miners
should not focus on predictive accuracy, model stability, or any other
technical metric for predictive models at the expense of business insight and
business fit.
——————————————————————————————————
9th Law of Data Mining –
“Law of Change”: All patterns are subject to change
The patterns discovered
by data mining do not last forever. This is well-known in many applications of
data mining, but the universality of this property and the reasons for it are
less widely appreciated.
In marketing and CRM
applications of data mining, it is well-understood that patterns of customer
behaviour are subject to change over time. Fashions change, markets and
competition change, and the economy changes as a whole; for all these reasons,
predictive models become out-of-date and should be refreshed regularly or when
they cease to predict accurately.
The same is true in risk
and fraud-related applications of data mining. Patterns of fraud change with a
changing environment and because criminals change their behaviour in order to
stay ahead of crime prevention efforts. Fraud detection applications must
therefore be designed to detect new, unknown types of fraud, just as they must deal
with old and familiar ones.
Some kinds of data
mining might be thought to find patterns which will not change over time – for
example in scientific applications of data mining, do we not discover
unchanging universal laws? Perhaps surprisingly, the answer is that even these
patterns should be expected to change.
The reason is that
patterns are not simply regularities which exist in the world and are reflected
in the data – these regularities may indeed be static in some domains. Rather,
the patterns discovered by data mining are part of a perceptual process, an
active process in which data mining mediates between the world as described by
the data and the understanding of the observer or business expert. Because our
understanding continually develops and grows, so we should expect the patterns
also to change. Tomorrow’s data may look superficially similar, but it will
have been collected by different means, for (perhaps subtly) different
purposes, and have different semantics; the analysis process, because it is
driven by business knowledge, will change as that knowledge changes. For all
these reasons, the patterns will be different.
To express this briefly,
all patterns are subject to change because they reflect not only a changing
world but also our changing understanding
Postscript
The 9 Laws of Data
Mining are simple truths about data mining. Most of the 9 laws are already
well-known to data miners, although some are expressed in an unfamiliar way
(for example, the 5th, 6th and 7th laws). Most of the new ideas associated with
the 9 laws are in the explanations, which express an attempt to understand the
reasons behind the well-known form of the data mining process.
Why should we care why
the data mining process takes the form that it does? In addition to the simple
appeal of knowledge and understanding, there is a practical reason to pursue
these questions.
The data mining process
came into being in the form that exists today because of technological
developments – the widespread availability of machine learning algorithms, and
the development of workbenches which integrated these algorithms with other
techniques and make them accessible to users with a business-oriented outlook.
Should we expect technological change to change the data mining process? Eventually
it must, but if we understand the reasons for the form of the process, then we
can distinguish between technology which might change it and technology which
cannot.
Several technological
developments have been hailed as revolutions in predictive analytics, for
example the advent of automated data preparation and model re-building, and the
integration of business rules with predictive models in deployment frameworks.
The 9 laws of data mining suggest, and their explanations demonstrate, that
these developments will not change the nature of the process. The 9 laws, and
further development of these ideas, should be used to judge any future claims
of revolutionising the data mining process, in addition to their educational
value for data miners.
I would like to thank
Chris Thornton and David Watkins, who supplied the insights which inspired this
work, and also to thank all those who have contributed to the LinkedIn “9 Laws
of Data Mining” discussion group, which has provided invaluable food for
thought.