Introduction
In this blog I'm going to explain why data quality and data
lineage are so important for anybody looking for needles in haystacks using advanced
analytics or big data techniques. This is especially relevant if you are
trawling through large datasets looking for a rarely occurring phenomenon, and
the impact of getting it wrong matters. Ultimately it comes down to this: the
error rate in your end result needs to be smaller than the population that you
are looking for. If it isn't then your results will have more errors in them
than valid results. Understanding Bayes' theorem is the key to understanding
why.
I discovered this phenomenon last year when I applied
Business Analytics to a particular problem that I thought would yield real
results. The idea was to try and predict a certain kind of (undesirable)
customer behaviour as early as possible and so try and so prevent it before it
happened. While the predictive models that came out of this exercise were good,
they were ultimately unusable in practice because the number of false positives
was unacceptably high. What surprised me most was the fact that the false
positives outnumbered the true positives. In other words, when the algorithm
predicted the undesirable outcome that we were looking for, the chances were
that it was wrong… and the impact of using it would have effectively meant
unacceptable discrimination against customers. I was surprised and disappointed
because this was a good algorithm with a good lift factor, but ultimately
unusable.
Bayes' Theorem
At the same time I was reading a book called Super Crunchers,
by Ian Ayres, which by the way I thoroughly recommend, especially if you are
looking for a readable account of what can and can’t be done through number
crunching. Towards the end of the book is an explanation of Bayes' theorem and
how to apply it when interpreting any kinds of predictive algorithms or tests.
Now, I learned Bayes' theorem when I was at school, but this chapter was a
really useful reminder. The theorem allows you to infer a hidden probability
based on known, measured probabilities When I applied it to the problem I just
described above, it made a lot of sense.
How 99% accuracy can
lead to a 50% error rate
What I learnt was this… if you are applying an algorithm to
a large population in order to find something that only affects a small
minority of that population then you need to have a good idea of how reliable
your end result is in order to work out how many of its predictions are wrong. The
end result is highly dependent on the quality of the data that you start with,
and then the quality of the analytics algorithms you use as well as any
intermediate processing that you do on it. What’s more, if the error rate is
about the same as the size of the population you are looking for, then around
half of the predictions will be false. So, if you are looking for something
that affects 1% of your population and you have an algorithm that is 99%
accurate, then half of its predications will be wrong.
To demonstrate this, I’ll use a hypothetical credit
screening scenario. Imagine that a bank has a screening test for assessing
creditworthiness that is 99% accurate, and they apply it to the all customers
who apply for loans. The bank knows from experience that 1% of their customers
will default. The question then is, of those customers identified as at risk of
default, how many will actually default. This is exactly the sort of problem
that Bayes' theorem answers.
Let’s see how this would work for a screening 10,000 applicants.
Of the 10,000 customers, 100 will default on their loan. The
test is 99% accurate, so it will make 1 mistake in this group. One customer
will pass the screening and still default later. The other 99 will receive a
correct result and be identified as future defaulters. This would look pretty
attractive to anyone trying to reduce default rates.
In the group of 10,000 applicants, 9,900 will not default.
The test is 99% accurate, and so 99 of those customers will wrongly be
identified as credit risks. This would look unattractive to anyone who is paid
a commission on selling loans to customers.
So, we have a total of 198 customers identified as credit
risks, of which 99 will default and 99 will not. So in this case, if you are
identified as a credit risk, then you still have a 50% chance of being a good
payer… and that’s with a test that is 99% accurate. Incidentally, the chances
of a customer passing the credit check and then defaulting are now down to 1 in
10,000.
Accuracy = 99%
|
Default predicted
|
||
Customer Defaults
|
True
|
False
|
Total
|
True
|
99
|
1
|
100
|
False
|
99
|
9,801
|
9,900
|
Total
|
198
|
9,802
|
10,000
|
It occurred to me that this logic is valid for any situation
where you are looking for needles in haystacks, and that is the kind of thing
that people are doing today under the banner of Big Data – trawling through
large volumes of data to find “hidden gems”. Other examples would include mass
screening of DNA samples to find people susceptible to cancer or heart disease
or trawling through emails looking for evidence of criminal behaviour.
Now, in the example I gave, I deliberately used clean
numbers where the size of the minority we’re looking for (1%) was equal to the
error rate. In reality they are unlikely to be equal. The graphic below shows
how the rate of false positives varies as the size of the minority and the
overall error rate change. What this graph shows very clearly, is that to be
useful, the end result of your prediction algorithm needs to generate fewer
errors in total than the size of the target population that you are trying to
find. Furthermore, if you are looking for small populations in large data sets,
then you need to know how reliable your prediction is, and that is highly
dependent on the reliability of the data you are starting with. If you are
looking for 1 in 1,000 then 99% accuracy isn't good enough, because then 90% of
your results will be wrong.
What makes a good
predictive model?
So let’s imagine that we want to apply Big Data techniques
to find some needles in a large haystack of data, and we are expecting these to
occur at around 0.1% of our total population. Bayes' theorem tells us that to
be useful, our predictive algorithm needs to be more than 99.9% accurate. There
are two ways of improving the accuracy, either by using the best possible
algorithm or by using the best possible data available.
There’s been a lot of work done on algorithms and nearly all
of it is available. Market leading analytical software is affordable, so if you
really are looking needles in haystacks and there is value in it, the tools are
available to you.
What’s less obvious is the quality of the input data. While
algorithms are reproducible, good data isn't. Each new problem that someone
needs to tackle needs its own dataset. Each organisation has its own data for
solving its own problems, and just because one bank has good data, doesn't mean
that all banks have good data.
The chances are that if you are looking for needles in
haystacks, then what you’re finding probably aren't the needles you were
looking for at all. If you've assessed the reliability of your predictive model
then you may even be wildly over confident in the results. While you can invest
in better algorithms, if you really want better results you will probably only
get them by using better data.