Tuesday 4 March 2014

Why Bayes’ Theorem means that Data Quality really matters if you are looking for Needles in Haystacks

Introduction
In this blog I'm going to explain why data quality and data lineage are so important for anybody looking for needles in haystacks using advanced analytics or big data techniques. This is especially relevant if you are trawling through large datasets looking for a rarely occurring phenomenon, and the impact of getting it wrong matters. Ultimately it comes down to this: the error rate in your end result needs to be smaller than the population that you are looking for. If it isn't then your results will have more errors in them than valid results. Understanding Bayes' theorem is the key to understanding why.


I discovered this phenomenon last year when I applied Business Analytics to a particular problem that I thought would yield real results. The idea was to try and predict a certain kind of (undesirable) customer behaviour as early as possible and so try and so prevent it before it happened. While the predictive models that came out of this exercise were good, they were ultimately unusable in practice because the number of false positives was unacceptably high. What surprised me most was the fact that the false positives outnumbered the true positives. In other words, when the algorithm predicted the undesirable outcome that we were looking for, the chances were that it was wrong… and the impact of using it would have effectively meant unacceptable discrimination against customers. I was surprised and disappointed because this was a good algorithm with a good lift factor, but ultimately unusable.

Bayes' Theorem
At the same time I was reading a book called Super Crunchers, by Ian Ayres, which by the way I thoroughly recommend, especially if you are looking for a readable account of what can and can’t be done through number crunching. Towards the end of the book is an explanation of Bayes' theorem and how to apply it when interpreting any kinds of predictive algorithms or tests. Now, I learned Bayes' theorem when I was at school, but this chapter was a really useful reminder. The theorem allows you to infer a hidden probability based on known, measured probabilities When I applied it to the problem I just described above, it made a lot of sense.


How 99% accuracy can lead to a 50% error rate
What I learnt was this… if you are applying an algorithm to a large population in order to find something that only affects a small minority of that population then you need to have a good idea of how reliable your end result is in order to work out how many of its predictions are wrong. The end result is highly dependent on the quality of the data that you start with, and then the quality of the analytics algorithms you use as well as any intermediate processing that you do on it. What’s more, if the error rate is about the same as the size of the population you are looking for, then around half of the predictions will be false. So, if you are looking for something that affects 1% of your population and you have an algorithm that is 99% accurate, then half of its predications will be wrong.
To demonstrate this, I’ll use a hypothetical credit screening scenario. Imagine that a bank has a screening test for assessing creditworthiness that is 99% accurate, and they apply it to the all customers who apply for loans. The bank knows from experience that 1% of their customers will default. The question then is, of those customers identified as at risk of default, how many will actually default. This is exactly the sort of problem that Bayes' theorem answers.
Let’s see how this would work for a screening 10,000 applicants.
Of the 10,000 customers, 100 will default on their loan. The test is 99% accurate, so it will make 1 mistake in this group. One customer will pass the screening and still default later. The other 99 will receive a correct result and be identified as future defaulters. This would look pretty attractive to anyone trying to reduce default rates.
In the group of 10,000 applicants, 9,900 will not default. The test is 99% accurate, and so 99 of those customers will wrongly be identified as credit risks. This would look unattractive to anyone who is paid a commission on selling loans to customers.
So, we have a total of 198 customers identified as credit risks, of which 99 will default and 99 will not. So in this case, if you are identified as a credit risk, then you still have a 50% chance of being a good payer… and that’s with a test that is 99% accurate. Incidentally, the chances of a customer passing the credit check and then defaulting are now down to 1 in 10,000.

Accuracy = 99%
Default predicted
Customer Defaults
True
False
Total
True
99
1
100
False
99
9,801
9,900
Total
198
9,802
10,000

It occurred to me that this logic is valid for any situation where you are looking for needles in haystacks, and that is the kind of thing that people are doing today under the banner of Big Data – trawling through large volumes of data to find “hidden gems”. Other examples would include mass screening of DNA samples to find people susceptible to cancer or heart disease or trawling through emails looking for evidence of criminal behaviour.
Now, in the example I gave, I deliberately used clean numbers where the size of the minority we’re looking for (1%) was equal to the error rate. In reality they are unlikely to be equal. The graphic below shows how the rate of false positives varies as the size of the minority and the overall error rate change. What this graph shows very clearly, is that to be useful, the end result of your prediction algorithm needs to generate fewer errors in total than the size of the target population that you are trying to find. Furthermore, if you are looking for small populations in large data sets, then you need to know how reliable your prediction is, and that is highly dependent on the reliability of the data you are starting with. If you are looking for 1 in 1,000 then 99% accuracy isn't good enough, because then 90% of your results will be wrong.

What makes a good predictive model?
So let’s imagine that we want to apply Big Data techniques to find some needles in a large haystack of data, and we are expecting these to occur at around 0.1% of our total population. Bayes' theorem tells us that to be useful, our predictive algorithm needs to be more than 99.9% accurate. There are two ways of improving the accuracy, either by using the best possible algorithm or by using the best possible data available.
There’s been a lot of work done on algorithms and nearly all of it is available. Market leading analytical software is affordable, so if you really are looking needles in haystacks and there is value in it, the tools are available to you.
What’s less obvious is the quality of the input data. While algorithms are reproducible, good data isn't. Each new problem that someone needs to tackle needs its own dataset. Each organisation has its own data for solving its own problems, and just because one bank has good data, doesn't mean that all banks have good data.

The chances are that if you are looking for needles in haystacks, then what you’re finding probably aren't the needles you were looking for at all. If you've assessed the reliability of your predictive model then you may even be wildly over confident in the results. While you can invest in better algorithms, if you really want better results you will probably only get them by using better data.