Mark Humphries - Thoughts on Information Management

Tuesday, 1 April 2014

The Human Side of Data Quality

In this blog I will be arguing that the human side of data quality is not just needed, but that actually the work that people perform is far more valuable than the work done by systems and tools.

Starting with a tools based approach

First though, I should admit that I didn't always hold this view. When I first started working seriously with Data Quality, I had my feet firmly in the IT camp: I believed that applying the right tools could was the key to fixing Data Quality. I had years of experience behind me of moving data around between systems and struggling with trying to get it to fit. I thought I knew what the solution should look like. As a self-confessed data geek, I was open to the claims of tool vendors that their technology based on smart matching algorithms and expert based rules could fix just about any Data Quality problem. After all they had demonstrations to show how good they were and references to back it up.

So, I took my first steps into applying technology to solve Data Quality problems. I started with addresses and de-duplication of customer records. As I was working in the household energy supply business at the time, the business case was based on reducing the total number of bills sent out, and maximising the number that ended up at the right address. Initial results were impressive, and the scale of the work involved meant that it could only be achieved by automation. Tens or hundreds of thousands of corrections made over a weekend was not uncommon.

The grey zone

As we went further, though, an uncomfortable truth started to emerge. There was always a grey zone where you couldn't really be sure if the tools were right or wrong. At the heart of any algorithmic or expert based approach is the technique of scoring possible outcomes. Based on the input data the tool will suggest a result along with a score. If the score is high, then you can be pretty confident that it is correct, and if it is too low then it should be discarded, because it’s probably wrong. The problem is where to define the cut off point. How high does the score need to be for you to trust the result? Set it too low and you will end up introducing errors, which is ironic since you are trying to eliminate errors. Set the threshold too high and you will discard perfectly good corrections.

At first this wasn't too much of a problem. When you are making tens or hundreds of thousands of corrections you don’t need to worry about three of four hundred dubious results. The temptation is to err on the side of caution, set the threshold comfortably high and avoid the risk of making anything worse than it is. The grey zone can wait, but only for so long. Once you have cleaned the bulk of your data, and you perform regular cleaning runs, the number of automated corrections that you can make reduces, but the number of near misses stays constant or slowly increases. Eventually you have to look into this in detail.

My initial observation was that the tools were making a lot of good suggestions in this grey zone, but to just lower the threshold and accept lower scores would introduce errors; not many, but enough. I spent a lot of time trawling through examples looking for patterns in this space. Two customers with the same family name in the same street, one called Tim, living at number 25, the other called Tom living at number 52… is it the same person or two different people? Are both addresses correct or has one been accidentally transposed? It’s a frustrating problem, but I found that most of the time, I could find the right answer if I looked beyond the dataset that was being fed into the tools. Reluctantly I admitted defeat, and started lobbying for a small team of customer service agents to go through these lists and clean them up by hand, and that worked well.

Adding people to the mix

There was an initial feeling of disappointment, though, and a sense that the technology was failing. Slowly it became clear that this was a structural problem, we built a layer on top of the tools that presented these agents with the source data as it was, the suggestions of the tools, details of the scoring rules and colour coding to indicate degrees of uncertainty. Next to that we added to links to popular external sources like the CRM system or Google Maps.

As this work progressed the approach was extended to cover different problems. Slowly it dawned on me that this was a structural problem, and that the human element was necessary. People are able to add context, and to look at other available sources including communication histories with customers, which enables them to come to the right answer on a case by case basis. Furthermore, it eventually became apparent that the corrections that they were making had a much higher value on a case by case basis than the automated changes. Small changes are easy to correct, but also less likely to cause real confusion. If an address is incorrectly recorded as Groove Street, Bath, BA1 5LR, the postman will know that it is destined for Grove Street because it’s a one letter difference that matches with the postcode and there are no other alternatives. Post in this case would still be delivered to the right address. Another address recorded as Queen Street, Bath, BA1 2HX is more problematic though because that postcode refers to Queen Square but there is a Queen Street in Bath with the postcode is BA1 1HE. To know whether the street name is wrong or the postcode would need more information. In this case post may very well be going to the wrong address. That means that correcting this error is more valuable than correcting the first error.

This is a pattern that occurs again and again with Data Quality. Corrections that are easy to make using technology and tools are usually high in frequency, but actually low in impact. Many of these errors do not make much real difference. Corrections that are not easy to make with technology, though, usually do matter and it’s worthwhile taking the time to look at these cases in detail, because it will make a difference.

Technology and people working together

Ultimately it’s worth combining people with technology, because they complement each other. What the technology does for you is to clear up the high volume of easy to fix errors. Dedicating people to this would be expensive, slow and has potentially little effect. The fact that the technology can do this and then identify the grey zone though is hugely valuable. The volumes here are much lower, but on a case by case basis they really do matter. Routing these cases to a team of people who can investigate them and make the necessary corrections is very worthwhile and has a high return for the investment. Good Data Quality tools today come with workflow extensions precisely for this, but unless you understand why, it is easy to overlook the importance of it.

If you have problems with Data Quality and you are evaluating tools to help you, then I advise you to be sceptical, but in a positive way. Don’t expect any tool to solve all your problems, and beware of any vendor claiming that their tools can. Dig a bit deeper and ask about what happens when the tools aren't sure? How easily can these cases be passed back to a team of Data Stewards? Does the tool come with a workflow engine that can route these cases to the right people? But, more importantly, think about preparing your organisation for this. The tools will only get you so far, but to get the most value from them, your own people will make the biggest difference.

Tuesday, 4 March 2014

Why Bayes’ Theorem means that Data Quality really matters if you are looking for Needles in Haystacks

Introduction

In this blog I'm going to explain why data quality and data lineage are so important for anybody looking for needles in haystacks using advanced analytics or big data techniques. This is especially relevant if you are trawling through large datasets looking for a rarely occurring phenomenon, and the impact of getting it wrong matters. Ultimately it comes down to this: the error rate in your end result needs to be smaller than the population that you are looking for. If it isn't then your results will have more errors in them than valid results. Understanding Bayes' theorem is the key to understanding why.

I discovered this phenomenon last year when I applied Business Analytics to a particular problem that I thought would yield real results. The idea was to try and predict a certain kind of (undesirable) customer behaviour as early as possible and so try and so prevent it before it happened. While the predictive models that came out of this exercise were good, they were ultimately unusable in practice because the number of false positives was unacceptably high. What surprised me most was the fact that the false positives outnumbered the true positives. In other words, when the algorithm predicted the undesirable outcome that we were looking for, the chances were that it was wrong… and the impact of using it would have effectively meant unacceptable discrimination against customers. I was surprised and disappointed because this was a good algorithm with a good lift factor, but ultimately unusable.

Bayes' Theorem

At the same time I was reading a book called Super Crunchers, by Ian Ayres, which by the way I thoroughly recommend, especially if you are looking for a readable account of what can and can’t be done through number crunching. Towards the end of the book is an explanation of Bayes' theorem and how to apply it when interpreting any kinds of predictive algorithms or tests. Now, I learned Bayes' theorem when I was at school, but this chapter was a really useful reminder. The theorem allows you to infer a hidden probability based on known, measured probabilities When I applied it to the problem I just described above, it made a lot of sense.

How 99% accuracy can lead to a 50% error rate

What I learnt was this… if you are applying an algorithm to a large population in order to find something that only affects a small minority of that population then you need to have a good idea of how reliable your end result is in order to work out how many of its predictions are wrong. The end result is highly dependent on the quality of the data that you start with, and then the quality of the analytics algorithms you use as well as any intermediate processing that you do on it. What’s more, if the error rate is about the same as the size of the population you are looking for, then around half of the predictions will be false. So, if you are looking for something that affects 1% of your population and you have an algorithm that is 99% accurate, then half of its predications will be wrong.

To demonstrate this, I’ll use a hypothetical credit screening scenario. Imagine that a bank has a screening test for assessing creditworthiness that is 99% accurate, and they apply it to the all customers who apply for loans. The bank knows from experience that 1% of their customers will default. The question then is, of those customers identified as at risk of default, how many will actually default. This is exactly the sort of problem that Bayes' theorem answers.

Let’s see how this would work for a screening 10,000 applicants.

Of the 10,000 customers, 100 will default on their loan. The test is 99% accurate, so it will make 1 mistake in this group. One customer will pass the screening and still default later. The other 99 will receive a correct result and be identified as future defaulters. This would look pretty attractive to anyone trying to reduce default rates.

In the group of 10,000 applicants, 9,900 will not default. The test is 99% accurate, and so 99 of those customers will wrongly be identified as credit risks. This would look unattractive to anyone who is paid a commission on selling loans to customers.

So, we have a total of 198 customers identified as credit risks, of which 99 will default and 99 will not. So in this case, if you are identified as a credit risk, then you still have a 50% chance of being a good payer… and that’s with a test that is 99% accurate. Incidentally, the chances of a customer passing the credit check and then defaulting are now down to 1 in 10,000.

Accuracy = 99%	Default predicted
Customer Defaults	True	False	Total
True	99	1	100
False	99	9,801	9,900
Total	198	9,802	10,000

It occurred to me that this logic is valid for any situation where you are looking for needles in haystacks, and that is the kind of thing that people are doing today under the banner of Big Data – trawling through large volumes of data to find “hidden gems”. Other examples would include mass screening of DNA samples to find people susceptible to cancer or heart disease or trawling through emails looking for evidence of criminal behaviour.

Now, in the example I gave, I deliberately used clean numbers where the size of the minority we’re looking for (1%) was equal to the error rate. In reality they are unlikely to be equal. The graphic below shows how the rate of false positives varies as the size of the minority and the overall error rate change. What this graph shows very clearly, is that to be useful, the end result of your prediction algorithm needs to generate fewer errors in total than the size of the target population that you are trying to find. Furthermore, if you are looking for small populations in large data sets, then you need to know how reliable your prediction is, and that is highly dependent on the reliability of the data you are starting with. If you are looking for 1 in 1,000 then 99% accuracy isn't good enough, because then 90% of your results will be wrong.

What makes a good predictive model?

So let’s imagine that we want to apply Big Data techniques to find some needles in a large haystack of data, and we are expecting these to occur at around 0.1% of our total population. Bayes' theorem tells us that to be useful, our predictive algorithm needs to be more than 99.9% accurate. There are two ways of improving the accuracy, either by using the best possible algorithm or by using the best possible data available.

There’s been a lot of work done on algorithms and nearly all of it is available. Market leading analytical software is affordable, so if you really are looking needles in haystacks and there is value in it, the tools are available to you.

What’s less obvious is the quality of the input data. While algorithms are reproducible, good data isn't. Each new problem that someone needs to tackle needs its own dataset. Each organisation has its own data for solving its own problems, and just because one bank has good data, doesn't mean that all banks have good data.

The chances are that if you are looking for needles in haystacks, then what you’re finding probably aren't the needles you were looking for at all. If you've assessed the reliability of your predictive model then you may even be wildly over confident in the results. While you can invest in better algorithms, if you really want better results you will probably only get them by using better data.

Friday, 31 January 2014

The Benefits of a Single Customer View

In previous roles, I have delivered Single Customer View projects twice. In each case I prepared a very positive business case before getting approval for the project, and in each case the benefits were evaluated after implementations. Unlike other projects that I have delivered the business case at the end actually ended up being more positive than originally forecast.

In this blog I would like to share with you how a well implemented Single Customer View can deliver benefits far beyond the associated costs.

What is a Single Customer View, and why would you want it?

Put simply it’s about having a single “golden” record for each of your customers, so that whenever you refer to a customer, you’re using the most up to date details, and everything you know about that customer is correctly linked to them. It means that when a customer calls your call centre, you can find their details quickly, and can see their history. It means that you know how many customers you actually have, and what they are worth to you. It means that customers are central to your business model rather than accounts, subscriptions or orders.

Single Customer View as a requirement for compliance

If a government or regulator mandates that you strive to achieve a Single View of your customers, then you don’t need a business case. Either you do it, or you lose your licence to operate. An example of this is the Know Your Customer rules that apply to banks, who need to successfully identify their customers in order to avoid being used for money laundering activities.

However, I would argue that the other benefits that come from a Single View are so important that you should consider them too. If it is mandatory, there is a temptation to stop there and do it because you have to. I would encourage you to keep reading. If you have to implement a Single Customer View, then you may as well get as much value out of it as possible.

Cost reduction

If you have a large number of customers and communication with them is a major cost for your business then you can make surprisingly big savings by implementing a Single Customer View. This would typically be an important part of the business case for energy suppliers, telecoms, insurers, banks or local government where regular postal communication to all of your customers is a major cost. In some cases, it may be worth going one step further and thinking in terms of Single Household, and thereby avoid needless costs by sending duplicate communication to everyone in a household.

Risk Management

One of the reasons why governments have mandated Know Your Customer practices for banks is to facilitate better risk management. Each customer has their own risk profile, and if you are managing that risk profile, then it will be a lot more accurate if you make the effort to link everything that you know about a customer via a single customer record. The risk profile for a customer will be much more complete if you can link their borrowing to their savings for example. If you can identify a returning customer who has been with your competition for a while, then you can assess their risk profile much more accurately than if you treated them like a new and unknown customer.

The more accurately you can assess your total risk, the less provision you need to make to cover unknown risks. This in turn releases capital for other opportunities or further investments.

Customer Loyalty

It often comes as a surprise, but most customers actually like it when the companies they deal with have a complete picture of them. It boosts confidence that they are treated as an individual and not just as an account. Admittedly it’s not universal, and some customers value their anonymity, but in my experience the overwhelming majority of customers expect you to know them. When customers hold multiple accounts with you, they expect you to join up the dots, and treat them as individuals. Personally, I like it when Amazon suggests books to me that I might like but I get annoyed when my bank tries to sell me a credit card that I already have. One is showing me that they know me, and the other is showing me that they don’t care about me.

Fraud

If combating fraud is important to your business, then a Single Customer View can be an invaluable weapon in your armoury. This is particularly true if you are operating on tight margins. Losses due to fraud scale typically with turnover, but they go all the way to the bottom line. So, if you are losing 2% of your turnover to fraud, this could easily be 20% of your net profits.

Fraudsters benefit from anonymity and multiple accounts, but many of them don’t apply particularly sophisticated techniques. Significant fraud can be avoided simply by matching small changes in names or dates of birth. Even more can be avoided if you can identify members of the same household who take it in turns to run up debts they have no intention of repaying.

I have seen business cases where fraud reduction was the biggest single benefit for the Single Customer View, and yet it is often overlooked because fraud prevention is not considered as part of the core business.

Business Analytics

The benefit in terms of Business Analytics is one of the least obvious benefits, and also the most difficult to quantify in advance. Nevertheless, I have seen the difference that it makes, and it can be substantial if your business analytics are aimed at understanding customer behaviour in order to be able to anticipate it. Examples are churn prediction or credit scoring.

In these scenarios the benefit comes from improving the quality of the data going in to your analytics. Practitioners of business analytics generally agree that improvements in the quality of input data have a far greater effect than using the latest algorithms. It’s not accounts that decide to leave your company for the competition, it’s customers. If one customer is having problems on one of his four accounts, then he won’t just take that one account to the competition. He will take all four, but if you don’t know that they are linked, you won’t be able to see it coming. Similarly, by linking all of a customer’s accounts you will be able to assess their credit rating accurately when they place a large order that you don’t want to lose but can’t afford to give away.

Conclusion

I have highlighted six areas that I think you should consider when evaluating the benefits side of the business case for the Single Customer View. If you are considering implementing a Single Customer View, the chances are that you are championing one of these benefits as the case for action. I would strongly recommend that you consider the others. For most companies, I would expect the benefits to be at least twice the costs. If you time it right it can even be possible to achieve pay back in the same financial year that you launch the project.

Monday, 6 January 2014

How to get to 99% and beyond... it's all about managing exceptions

In my last blog I made the case that getting your business to 99% and beyond can be a source of sustainable competitive advantage and so it should be a priority for business leaders. In this article, I will explain how to get to 99% and beyond by managing exceptions.

That's not what we expected!

Most businesses are structured according to processes. Although there are exceptions that only work on unique projects, most companies are doing a lot of repetitive work and their activities are defined by processes that are triggered by events and which lead to predictable results. Whether you are buying a burger on the high street or an exotic holiday on-line, the decision to purchase triggers processes that deliver what you ordered, ensure that you pay for it, and line everything up for the next customer. Businesses aim for consistent processes to ensure that the end result is the same regardless of who takes the order or what day of the week it is.

However, real life has a way of throwing up surprises that are not foreseen in these processes. Even if you get your customer service to follow a consistent script, you can’t always expect your customers to follow the same script. Processes fail because things happen in the world outside your business that weren't foreseen in your processes.

When I lived in Germany, for example, I opened a joint bank account with my wife. At that point we generated an exception because my wife chose to keep her maiden name when we married. In those days German banks assumed that married couples had one surname; separate surnames on a joint account were not foreseen. In order to open the account my wife was known as Mrs Humphries. While this annoyed her, she could live with it, until the day when she triggered another process that required authentication of ID. Of course, she couldn't produce ID to prove that she was Mrs Humphries, because she wasn't. Luckily, the man who opened the bank account for us was available, and he was able to confirm her identity.

This is an example of an exception, and it’s something that happens daily in every business. In this case an employee took the initiative and saved the situation. His intervention was dependent on his personal knowledge, though. Had he not been there, the bank would have had an angry customer on their hands. Just as important is this: had there been another bank that allowed husbands and wives to have different family names then we would have changed banks.

If my example doesn't convince you, think about the frustrations that you have had as a customer because you are asking for something that falls outside of the mainstream. The simple act of moving house will give you an insight into the number of companies that assume people only have one address and a credit history linked to that address. After all it’s an assumption that works most of the time.

Understanding the nature of these exceptions and dealing with them will take you from getting it right 95% of the time to getting it right 99% of the time and beyond. For the record, 100% is not achievable. How many 9s you can achieve is what’s important. The thinner your margins, the more 9s you need.

Exceptions arise because process designers do not foresee everything that is going to happen. That’s understandable and normal. Reality is stranger than most of us realise until presented with evidence. Process designers typically start with a number of use cases and work them through. A use case is an example of a scenario that triggers a process. They use these use cases to explore possibilities that seem reasonable to them according to their experience. What they don’t do at this point is to think about all the possible exceptions that might occur. If they did they would probably go mad, or start on a work that has no end. In fact it’s a good thing that they don’t attempt to foresee every possible exception, because they would waste a lot of time preparing for exceptions that never happen. One aspect of exceptions that has surprised me over the years is how many potential exceptions never actually happen. I have seen process analysts tying themselves in knots trying to foresee every single possible problem, only to end up with unwieldy processes that are still caught out because something else happened that they didn't foresee.

The good news is that managing exceptions is actually not that complicated. But first you have to acknowledge that they are an unavoidable fact of life, and that they are important enough that you need to deal with them. It is not possible to engineer processes so that exceptions don’t occur. If you can’t accept this, then you should stop reading now, because the rest of this blog assumes these points to be true.

Are we dealing with apples or with fruit?

I have used three techniques which are relatively simple, but together they are extremely effective.

The first technique is to build exception handling into your processes and your organisation. You don’t have to know what is going to happen, but you do need to know what conditions should be met in order to proceed to the next step. When those conditions are not met, you foresee a step to handle the exception, you foresee people whose job it is to handle it and you foresee a consistent means for delivering those exceptions to them. What you don’t do, is try up front to describe exactly what the nature of the exceptions might be and what those people should do in response. Instead, you chose people who are good problem solvers and you give them the authority to use their initiative. It’s also worth considering giving them the option to overrule a control. A simple example that most of us have experienced is when an item in a supermarket is wrongly priced. The cashier cannot overrule the till and change the price, but he or she can call the supervisor, and if the supervisor is happy the price can be changed. The supervisor has the authority to overrule the process. So, the processes foresee exceptions, without prescribing a solution and a means of passing those exceptions on to people who can deal with them. The organisation foresees people who don’t work in the mainstream processes, but rather parallel to them, effectively dealing with all the exceptions that the process can’t deal with, and putting things back on track.

When these exceptions happen, it’s important that the process captures this information. One reason for doing this is to ensure that there is a trace of who is overruling your validations and how often. While it’s probably the right thing to do, it could also be fraud. More importantly though logging each of these exceptions enables the second technique, which is simply tracking the exceptions and performing root cause analysis to understand why they are happening in the first place. This will enable you to improve your processes and eliminate them with appropriate solutions. This technique is powerful because it is fact driven allowing you to focus on real problems. It does not depend on the imagination and insight of process analysts, and you focus your efforts on solving problems that really do happen, rather than problems that might. It also allows you to solve the problems that matter most because they happen more often.

Possibly the richest technique that I have used though is to delve into data quality problems and to feed them back into process improvement initiatives. This is less obvious than the first two suggestions, but often even more effective. Data and process are interdependent. Processes consume, modify and generate data, which in turn steers the processes. Process designers make assumptions about the data that drives their processes. If you can define these rules, and then search for data that doesn't respect them, then you win twice. Firstly you can identify data that will cause processes exceptions because it doesn't respect the business rules for the process. If you can identify the problematic data then you can fix the problem before it happens. Even more valuable though, is that data that doesn't fit the rules gives you an invaluable insight into how reality really is, rather than how you thought it was. By finding and analysing data that doesn't fit your rules, you get to understand the environment in which your business really operates, and how this is different from the way you assumed it works. If you then feed this back into your processes, again focussing on the problems that really occur rather than those that might, then you have a really powerful technique.

All animals on a sheep farm are sheep. All sheep are white.

So that’s it. The difference between getting it right 95% of the time and getting it right 99% or more of the time is down to managing exceptions. If you foresee these exceptions rather than fight them, and accept them for what they are, you can reduce your waste, increase your customer satisfaction and increase your competitiveness. Furthermore, it’s a virtuous circle: once you allow for exceptions, and allow them to drive your process improvement initiatives, then you can actually eliminate the most important exceptions by building them into your processes.

Friday, 29 November 2013

Why 99% isn't enough any more

The world has changed, and even as world economies emerge from the financial crisis, nothing is quite as it was before. In most markets today the barriers to entry are low, customer expectations are high and competition is fierce.

Whether you are selling products or services, B2B or B2C, there is no shortage of competition. The customer has never had so much choice. Furthermore it is easier than ever for consumers to share their thoughts about the quality of their latest purchases and the after sales customer care. Your brand is under constant scrutiny.

This is all driving margins relentlessly down. Businesses operating on 40% gross margins have become the exception and not the rule. More and more businesses are being forced to operate as utility operations rather than high end added value. Quality cannot suffer though. If the price is too high or the quality too low, then your customers will look to your competitors, because somebody somewhere will be offering higher quality at lower prices. This is the new reality.

Operating in this new reality means that nothing can be wasted. The maths is simple. A company turning over £50 million with gross margins of 40% can afford to waste 1% of its turnover. That translates into £500 thousand losses from £20 million profit. When operating at just 10% gross margin the same wastage is still £500 thousand, but now from a gross margin of just £5m. In other words just 1% of revenue waste turns into 10% of gross margin, this can easily turn into 50% of net profit. Suddenly 99% is no longer good enough. Think about that for a moment. Getting it right for 99 customers out of 100 is not good enough anymore.

Getting it right, on time, every time, for every customer is the new normal. Getting the right product or service to the right place at the right time for the right price is what is needed for success. The wrong product to the right customer is wastage. The right service at the wrong time is wastage. The right product at the wrong price is wastage. Granting credit to a customer who is not creditworthy is wastage as is refusing credit to a customer who is creditworthy. Each mistake eats relentlessly into the bottom line.

Getting it all right means knowing your customers, suppliers, products, inventory levels, sales channels and getting them all synchronised. It means getting all your processes, information systems and data as good as they can be.

Your business processes are the means of adding value by delivering your products or services to your customers when and where they want them at the right price and with a minimum of fuss. Your data is your picture of reality, on which those business processes operate. If either process or data is wrong, you will make mistakes, and those mistakes will eat into your margins and reduce your ability to compete. Make enough mistakes and your competitors will be happy to satisfy your customers.

On the other hand if your business process are highly optimised and can deal with the exceptions as well as the “happy flow” that accounts for 90% of your business, if your data quality is high, so that you know who your customers are, how much stock you hold, what the delivery timing is and so on, then you will have a competitive advantage.

In this new reality, with low margins and low customer loyalty, competitive advantage is to be found in optimised processes working on high quality data. This combination of process and data quality allows the best businesses to operate on gross margins that their competitors cannot afford to follow. That is the opportunity offered by the new reality.