Tuesday, 1 April 2014

The Human Side of Data Quality

In this blog I will be arguing that the human side of data quality is not just needed, but that actually the work that people perform is far more valuable than the work done by systems and tools.


Starting with a tools based approach
First though, I should admit that I didn't always hold this view. When I first started working seriously with Data Quality, I had my feet firmly in the IT camp: I believed that applying the right tools could was the key to fixing Data Quality. I had years of experience behind me of moving data around between systems and struggling with trying to get it to fit. I thought I knew what the solution should look like. As a self-confessed data geek, I was open to the claims of tool vendors that their technology based on smart matching algorithms and expert based rules could fix just about any Data Quality problem. After all they had demonstrations to show how good they were and references to back it up.


So, I took my first steps into applying technology to solve Data Quality problems. I started with addresses and de-duplication of customer records. As I was working in the household energy supply business at the time, the business case was based on reducing the total number of bills sent out, and maximising the number that ended up at the right address. Initial results were impressive, and the scale of the work involved meant that it could only be achieved by automation. Tens or hundreds of thousands of corrections made over a weekend was not uncommon.

The grey zone
As we went further, though, an uncomfortable truth started to emerge. There was always a grey zone where you couldn't really be sure if the tools were right or wrong. At the heart of any algorithmic or expert based approach is the technique of scoring possible outcomes. Based on the input data the tool will suggest a result along with a score. If the score is high, then you can be pretty confident that it is correct, and if it is too low then it should be discarded, because it’s probably wrong. The problem is where to define the cut off point. How high does the score need to be for you to trust the result? Set it too low and you will end up introducing errors, which is ironic since you are trying to eliminate errors. Set the threshold too high and you will discard perfectly good corrections.


At first this wasn't too much of a problem. When you are making tens or hundreds of thousands of corrections you don’t need to worry about three of four hundred dubious results. The temptation is to err on the side of caution, set the threshold comfortably high and avoid the risk of making anything worse than it is. The grey zone can wait, but only for so long. Once you have cleaned the bulk of your data, and you perform regular cleaning runs, the number of automated corrections that you can make reduces, but the number of near misses stays constant or slowly increases. Eventually you have to look into this in detail.
My initial observation was that the tools were making a lot of good suggestions in this grey zone, but to just lower the threshold and accept lower scores would introduce errors; not many, but enough. I spent a lot of time trawling through examples looking for patterns in this space. Two customers with the same family name in the same street, one called Tim, living at number 25, the other called Tom living at number 52… is it the same person or two different people? Are both addresses correct or has one been accidentally transposed? It’s a frustrating problem, but I found that most of the time, I could find the right answer if I looked beyond the dataset that was being fed into the tools. Reluctantly I admitted defeat, and started lobbying for a small team of customer service agents to go through these lists and clean them up by hand, and that worked well.


Adding people to the mix
There was an initial feeling of disappointment, though, and a sense that the technology was failing. Slowly it became clear that this was a structural problem, we built a layer on top of the tools that presented these agents with the source data as it was, the suggestions of the tools, details of the scoring rules and colour coding to indicate degrees of uncertainty. Next to that we added to links to popular external sources like the CRM system or Google Maps.
As this work progressed the approach was extended to cover different problems. Slowly it dawned on me that this was a structural problem, and that the human element was necessary. People are able to add context, and to look at other available sources including communication histories with customers, which enables them to come to the right answer on a case by case basis. Furthermore, it eventually became apparent that the corrections that they were making had a much higher value on a case by case basis than the automated changes. Small changes are easy to correct, but also less likely to cause real confusion. If an address is incorrectly recorded as Groove Street, Bath, BA1 5LR, the postman will know that it is destined for Grove Street because it’s a one letter difference that matches with the postcode and there are no other alternatives. Post in this case would still be delivered to the right address. Another address recorded as Queen Street, Bath, BA1 2HX is more problematic though because that postcode refers to Queen Square but there is a Queen Street in Bath with the postcode is BA1 1HE. To know whether the street name is wrong or the postcode would need more information. In this case post may very well be going to the wrong address. That means that correcting this error is more valuable than correcting the first error.
This is a pattern that occurs again and again with Data Quality. Corrections that are easy to make using technology and tools are usually high in frequency, but actually low in impact. Many of these errors do not make much real difference. Corrections that are not easy to make with technology, though, usually do matter and it’s worthwhile taking the time to look at these cases in detail, because it will make a difference.

Technology and people working together
Ultimately it’s worth combining people with technology, because they complement each other. What the technology does for you is to clear up the high volume of easy to fix errors. Dedicating people to this would be expensive, slow and has potentially little effect. The fact that the technology can do this and then identify the grey zone though is hugely valuable. The volumes here are much lower, but on a case by case basis they really do matter. Routing these cases to a team of people who can investigate them and make the necessary corrections is very worthwhile and has a high return for the investment. Good Data Quality tools today come with workflow extensions precisely for this, but unless you understand why, it is easy to overlook the importance of it.


If you have problems with Data Quality and you are evaluating tools to help you, then I advise you to be sceptical, but in a positive way. Don’t expect any tool to solve all your problems, and beware of any vendor claiming that their tools can. Dig a bit deeper and ask about what happens when the tools aren't sure? How easily can these cases be passed back to a team of Data Stewards? Does the tool come with a workflow engine that can route these cases to the right people? But, more importantly, think about preparing your organisation for this. The tools will only get you so far, but to get the most value from them, your own people will make the biggest difference.

No comments:

Post a Comment