In this blog I will be arguing that the human side of data
quality is not just needed, but that actually the work that people perform is
far more valuable than the work done by systems and tools.
Starting with a tools based approach
First though, I should admit that I didn't always hold this
view. When I first started working seriously with Data Quality, I had my feet
firmly in the IT camp: I believed that applying the right tools could was the
key to fixing Data Quality. I had years of experience behind me of moving data
around between systems and struggling with trying to get it to fit. I thought I
knew what the solution should look like. As a self-confessed data geek, I was
open to the claims of tool vendors that their technology based on smart
matching algorithms and expert based rules could fix just about any Data
Quality problem. After all they had demonstrations to show how good they were
and references to back it up.
The grey zone
As we went further, though, an uncomfortable truth started
to emerge. There was always a grey zone where you couldn't really be sure if
the tools were right or wrong. At the heart of any algorithmic or expert based
approach is the technique of scoring possible outcomes. Based on the input data
the tool will suggest a result along with a score. If the score is high, then
you can be pretty confident that it is correct, and if it is too low then it
should be discarded, because it’s probably wrong. The problem is where to
define the cut off point. How high does the score need to be for you to trust
the result? Set it too low and you will end up introducing errors, which is
ironic since you are trying to eliminate errors. Set the threshold too high and
you will discard perfectly good corrections.
At first this wasn't too much of a problem. When you are
making tens or hundreds of thousands of corrections you don’t need to worry
about three of four hundred dubious results. The temptation is to err on the
side of caution, set the threshold comfortably high and avoid the risk of
making anything worse than it is. The grey zone can wait, but only for so long.
Once you have cleaned the bulk of your data, and you perform regular cleaning
runs, the number of automated corrections that you can make reduces, but the
number of near misses stays constant or slowly increases. Eventually you have
to look into this in detail.
My initial observation was that the tools were making a lot
of good suggestions in this grey zone, but to just lower the threshold and
accept lower scores would introduce errors; not many, but enough. I spent a lot
of time trawling through examples looking for patterns in this space. Two
customers with the same family name in the same street, one called Tim, living
at number 25, the other called Tom living at number 52… is it the same person
or two different people? Are both addresses correct or has one been
accidentally transposed? It’s a frustrating problem, but I found that most of
the time, I could find the right answer if I looked beyond the dataset that was
being fed into the tools. Reluctantly I admitted defeat, and started lobbying
for a small team of customer service agents to go through these lists and clean
them up by hand, and that worked well.
Adding people to the mix
There was an initial feeling of disappointment, though, and
a sense that the technology was failing. Slowly it became clear that this was a
structural problem, we built a layer on top of the tools that presented these
agents with the source data as it was, the suggestions of the tools, details of
the scoring rules and colour coding to indicate degrees of uncertainty. Next to
that we added to links to popular external sources like the CRM system or
Google Maps.
As this work progressed the approach was extended to cover
different problems. Slowly it dawned on me that this was a structural problem,
and that the human element was necessary. People are able to add context, and
to look at other available sources including communication histories with
customers, which enables them to come to the right answer on a case by case
basis. Furthermore, it eventually became apparent that the corrections that
they were making had a much higher value on a case by case basis than the
automated changes. Small changes are easy to correct, but also less likely to
cause real confusion. If an address is incorrectly recorded as Groove Street,
Bath, BA1 5LR, the postman will know that it is destined for Grove Street
because it’s a one letter difference that matches with the postcode and there
are no other alternatives. Post in this case would still be delivered to the
right address. Another address recorded as Queen Street, Bath, BA1 2HX is more
problematic though because that postcode refers to Queen Square but there is a
Queen Street in Bath with the postcode is BA1 1HE. To know whether the street
name is wrong or the postcode would need more information. In this case post
may very well be going to the wrong address. That means that correcting this
error is more valuable than correcting the first error.
This is a pattern that occurs again and again with Data
Quality. Corrections that are easy to make using technology and tools are
usually high in frequency, but actually low in impact. Many of these errors do
not make much real difference. Corrections that are not easy to make with
technology, though, usually do matter and it’s worthwhile taking the time to
look at these cases in detail, because it will make a difference.
Technology and people working together
Ultimately it’s worth combining people with technology,
because they complement each other. What the technology does for you is to
clear up the high volume of easy to fix errors. Dedicating people to this would
be expensive, slow and has potentially little effect. The fact that the
technology can do this and then identify the grey zone though is hugely
valuable. The volumes here are much lower, but on a case by case basis they
really do matter. Routing these cases to a team of people who can investigate
them and make the necessary corrections is very worthwhile and has a high
return for the investment. Good Data Quality tools today come with workflow
extensions precisely for this, but unless you understand why, it is easy to
overlook the importance of it.
If you have problems with Data Quality and you are
evaluating tools to help you, then I advise you to be sceptical, but in a
positive way. Don’t expect any tool to solve all your problems, and beware of
any vendor claiming that their tools can. Dig a bit deeper and ask about what
happens when the tools aren't sure? How easily can these cases be passed back
to a team of Data Stewards? Does the tool come with a workflow engine that can
route these cases to the right people? But, more importantly, think about
preparing your organisation for this. The tools will only get you so far, but
to get the most value from them, your own people will make the biggest
difference.
No comments:
Post a Comment