Mark Humphries - Thoughts on Information Management

Wednesday, 11 February 2015

Information Management for Energy Suppliers - part 1

Unless you’re Google or IBM then Information Management is not your core business, which relegates it to a supporting function like HR or Accounting, and so Information Management only makes sense if it supports your core business and most importantly if the benefits actually exceed the costs. When I advise clients, I spend time first understanding their business model before prescribing an Information Management Strategy.

In the first of this two part blog, I am going to give a practical example of this by looking at the business of retail energy supply. Retail energy supply is characterised by high volumes, low margins, low customer loyalty and intense competition. These factors shape the business, so the logic is valid for other businesses with similar characteristics, like retail insurance. In my next blog, I will present an IM Strategy for such a business model.

A simplified business model for energy supply

First of all, let’s look at the financial model. A UK energy supplier will buy gas and electricity on the wholesale market and look to re-sell it to retail customers with a target gross margin of say, 23%. On top of the energy, the customer also pays transport and distribution fees for the delivery of the energy to their home as well as VAT and other taxes. The energy supplier collects these fees and passes them on to the appropriate companies and the treasury without adding any mark-up. In theory these charges are a zero-sum game; the supplier collects the money from the end customer and passes them on to the third party. In practice, zero is the best that the supplier can ever hope to achieve as we shall see.

Where the money goes

The supplier also has operating costs to consider. Because the business model is simple, and many processes can be fully automated, most of these costs are fixed costs independent of the size of the business, and a large part of the fixed costs is IT. Variable costs include the cost of sending bills and collecting payments and the cost of running a call centre to handle changes, questions and complaints.

As you can see from the above breakdown, above a 23% gross margin on traded energy quickly ends up becoming a 2% net profit for electricity and 7% for gas. A quick check on a comparison website will show that the differences between the suppliers are more than their net profits. This indicates how competitive the market is; the cheapest suppliers are offering prices that are below the costs of their more expensive competitors. How do they do this?

There are few options available to the energy supplier. The obvious ones are reducing the only input costs over which the supplier has control, namely the wholesale price and the operating costs.

In the wholesale market volume, accuracy and a good hedging strategy are the best way to get the best price. Accuracy is important because energy is ordered in advance on an hourly basis. The volumes ordered are passed on to power stations and gas shippers who will ensure that it is available on the grid. The price paid depends on the energy ordered and the energy actually consumed by your customers, and there are two components. The contracted price is for the ordered volume and the balance price that is paid for the difference between ordered and consumed volumes – it is effectively a penalty for ordering too much or too little energy. To get the best price, you need to minimise the imbalance price. You also get a better price if you order very high volumes through having a large market share.

For operating costs, volume is important again since the fixed costs are so high, but so is customer satisfaction, since happy customers don’t need to call the call centre. Next to that big savings can be made by sending bills electronically and by serving your customers with a good website. Intuitively it would seem that one way of increasing volume and keeping customers happy is to reduce your selling price. After all, rational customers will choose the supplier with the lowest price, and knowing that they are paying the lowest price will make them happy. In practice it’s not that simple. Customers expect predictability and customer service just as much as they expect a fair price. There is a saying in the market that "customers come for the price, but they stay for the service." Replacing lost customers with new customers is very expensive; it costs less to keep the customers you have happy. Getting the balance right enables you to offer a competitive price, retain customers and still make a small margin.

Happy customers are loyal customers

However, there is bigger problem that needs to be addressed – unpaid consumption. When consumption is not paid for, the supplier is still liable for the wholesale energy costs, transport and distribution costs and the taxes. When net margins are so slim, they are quickly eroded unpaid consumption. For every £100 of electricity that is not paid for, the supplier needs to bill and collect £5,000 of electricity consumption just to recover the losses.

So optimising the business of energy supply means:

Ensuring that the energy that you supply is paid for
Keeping operating costs low by having a large market share and happy customers
Keeping wholesale energy costs low by having a large market share and accurate forecasts

I’ll come back to those aspects and how Information Management can help them in my next blog.

Tuesday, 6 January 2015

New Year's Resolution for Data Management : Stop Using Analogies

I recently found myself in a surreal discussion in which it was suggested to me that meta-data was like poetry. I was asked to consider the line "I wandered lonely as a cloud" and it was suggested to me that "I" am the data, "lonely" is meta-data, and "as a cloud" is meta-meta-data. My interlocutor beamed at me with pride, and waited for my confirmation of their brilliance.

It's poetry, not meta-data

My heart sank as I realised that analogies are really not very do not helpful. Like many in Information Management, I have been guilty of using them far too much. I have now resolved to stop using analogies, and instead to make the effort to understand how others see the world, and to explain Information Management in terms that they understand about data that they really use.

Why do we use analogies?

A lot of the material in Information Management is pretty abstract, and it only really makes sense once you have already done it a couple of times. Anyone working in Information Management will be familiar with the challenge of explaining what it is and why you should do it. Meta-data is a great example, and the simple definition of "data about data" doesn't really help anyone new to the topic. So there's a temptation to introduce analogies. As well as poetry, a recent engagement threw up analogies including finger prints, traffic rules and photography. We use them to try and explain concepts that we understand to someone who doesn't.

Why analogies don't help

There are two problems with analogies, though. The first is that they eventually break down. Once you have introduced an analogy, the discussion inevitably explores it further and you end up debating where it is valid and where not. The second problem is that an analogy only really works for the person who came up with it. One of my favourites is weeding a garden as an analogy for data quality. To me it makes sense, because the weeds will always come back, whatever I do. So although I can aim for a weed free garden, I know that I will never achieve it. It's the same with Data Quality, while you may aim for zero defects, you can never actually achieve it. For those who don't garden, it doesn't help. For those who do garden, the discussion moves on quickly to dandelions, moss and bindweed. Either way, we're not getting very far with Data Quality, and the analogy breaks down because no one ever creates a weeding dashboard or assigns gardening stewards.

It's weeding, not Data Cleansing

From the specific to the general and back again

One of the problems comes from the way that we in Information Management think. As a group we tend to look for patterns and we are constantly seeking general abstractions in a sea of specific examples. A good example is the party data model which was born from the observation that customers, suppliers, employees, representatives and so on have common attributes and that they can be generalised as persons or organisations, that they can be related and so on. There are other generalised data models that have been developed over years. It's what we do, we can't help it. That's why we ended up in Information Management in the first place.

The problem is that the rest of the world doesn't think like this. Most people consider customers and suppliers to be fundamentally different, and a party is an event where people celebrate. We need to get back to specifics that are relevant to our stakeholders and give them concrete examples of what we are talking about.

My kind of party

From poetry to SOAP

In order to move the conversation away from poetry and onto something more useful, I dug a little deeper to find something more specific and relevant to someone seeking to understand meta-data. The guy that I was talking to had experience of integrating systems, so we talked about exchanging data between two or more systems. He suggested SOAP (www.w3.org/TR/soap12-part1) as his preferred protocol. Then we discussed how a SOAP message is specified as an XML Information Set, and that this is an example of meta-data. As we were on familiar ground, I could explain to him why such an Information Set would need to be owned, why it should be approved, why changes should be carefully managed, and what the risks of an incomplete definition would be. From this example, which he understood, the definition of "data about data" made sense, the need for formally managing it made sense, and he could begin to understand that these principles would apply in other scenarios.

And finally

If you must use poetry as an analogy for meta-data, then I suggest commentary on poetry is better. Lewis Carrol's Jabberwocky is one of my favourite poems, the wikipedia entry for it is a commentary and so it is writing about writing (en.wikipedia.org/wiki/Jabberwocky).

Oh dear, I've just broken my New Year's Resolution already.

Monday, 20 October 2014

Engaging Customers and Credit Controllers to Build a Single Customer View

In this blog I will discuss the advantages of actively engaging both customers and credit controllers to help create the Single Customer View and how this can give you the best of both worlds in terms of satisfying the customers that you want and protecting your business against fraud.

What does the customer think about a Single Customer View?

When developing a business case for the Single Customer View businesses typically focus on what's in it for them in terms of reduced cost, increased insight into customer behaviour and cross and upsell opportunities. One of the things that I have learned is that there is also a benefit for the customer too, and this is supported by research conducted by marketing agencies. Today’s consumer expects to be recognised and doesn't like being treated as a number. Most consumers get frustrated with having to maintain multiple accounts.

There are exceptions, of course. There are always exceptions. A very important group of customers actively avoids recognition and will deliberately create multiple accounts. These are the fraudsters who like your products and services, but don’t like paying for them. Luckily they are in the minority, but they form a very important minority and it’s important to protect yourself against them.

Understanding these two groups and foreseeing processes that allow for both of them is the key to establishing a robust Single Customer View and maximising the Return on Investment .

Self Service Single View

One of the biggest challenges in the Single Customer View is the creation of the so-called Golden Record, or combining all the available information and selecting the data from all the various records. Although there are good techniques and tools for automating this, there are always cases where you just can’t decide. Is the name Marc or Mark? Is the date of birth 7th February or 2nd July? When it comes to the Single Customer View an option that is easy to overlook is to engage the customer directly by giving him or her the possibility to edit the data themselves, to confirm and merge duplicate accounts and to create the golden record for you. Whenever I have proposed this option there have always been doubts raised about whether customers care enough to do this, or whether it will annoy them that they are being asked to “help”, or whether they might abuse it. Yet whenever I have implemented a solution whereby the customers are actively engaged in this way, the response is overwhelmingly positive.

The reality is that the majority of your customers do want you to know them, they do want you to have accurate data on them and they are more than happy to manage their data for you. After all it's their data and not yours. So engage them.

Beware the fraudsters

There is a legitimate argument for not allowing customers unrestricted access to their own data. This argument is typically raised by credit controllers. They work constantly with customers who can't or won't pay, and they know all the tricks that the fraudsters employ to avoid having to pay. Anonymity is one of the fraudster's most useful tricks, and so they deliberately create multiple accounts, maybe using slightly different names, dates of birth or addresses. One of the great advantages of creating a Single Customer View is the ability it gives you to unify these customers' accounts so that you get a unified view of their debt or risk and can then manage it better. Obviously if you identify multiple accounts as candidates for merging and you ask these customers if they are the same, then they will probably say no, and then maybe make more changes to increase the differences.

Two simple, but effective techniques

So, if it's a good idea to engage the customer most of the time, but you don’t want to give the fraudsters any more help, what's the best approach?

There are two techniques that you can apply that together will give you the best of both worlds. The first trick is to add a flag to the customer record that allows you to open or close their access to edit their data. The default value is open – the customer may edit their own data. If for any reason, you suspect that a customer may abuse this access, then you set it to closed, and the option no longer appears to them. It may be possible to develop rules so that the flag can be automatically opened or closed, but it must certainly be possible for the credit controllers to modify it by hand and block suspect customers. The rules for opening it again should be carefully defined. First line call centre agents, for example, are used to dealing with the majority of good customers, and you want them to be customer friendly. Giving them the right to open the editing rights on request may not be the best option – an escalation to the credit controllers in the back office may be a better idea. Whatever you decide, the rules for opening and closing access are important and need to be clear.

The second technique is to engage the credit controllers in the de-duplication process. Fraudsters will deliberately create multiple accounts with small differences. As a result their duplicate accounts are not always easy to spot, and the matching scores are likely to be ambiguous. By using other criteria such as outstanding debt you can route the de-duplication to specialist credit controllers for the final decision on whether or not two accounts should be merged and what the golden record would look like. They will be more than happy to do this, as it gives them a Single View of Debt. At this point, they would probably also want to close the edit option for such a client.

Conclusion

It is possible and desirable to actively engage the customer when creating a Single View, but doing so can open you up to the fraudsters. That should be countered by adding a flag to open or close a customer's access to editing their own data. Furthermore extending the de-duplication processes to include credit controllers will mean that the Single View actively helps you better manage the fraudsters rather than allowing them to manage you.

Monday, 12 May 2014

Are you measuring what's important, or what's easy to measure?

In this blog I am looking into Key Performance Indicators or KPIs. Often I see companies tracking and reporting lots of KPIs only to find that they are only measuring those things that are easy to measure and not measuring the things that really matter. My advice is to reduce the number of KPIs that you track, and only track those that will tell you how your business is really performing, but accept that these will be harder to define and measure.

It looks nice, but is it useful?

The importance of KPIs

In order to understand what's going on in your business you need to define, measure and track Key Performance Indicators (KPIs). A good set of KPIs will enable you to understand if your business is meeting the targets that you have set and all is well, or whether you are off track and need to take corrective action.

It is easy to draw analogies. A captain in charge of a ship needs to know his position and course as well as the fuel and other supplies that he has on board in order to know that he will reach his destination on time. At an annual check up your doctor may measure your heart rate, height and weight to check that you are healthy. He may also take a blood sample to measure useful KPIs in your blood like cholesterol, glucose level and so on. Both the ship's captain and the doctor are looking for irregularities in the KPIs. If they are out of tolerance they will probably monitor them more closely, investigate the root cause and take corrective action to bring the KPIs back within tolerance.

How it should work

It's exactly the same for businesses, and ideally you will have a hierarchy of KPIs that all link together in a coherent pyramid. At the highest level you will be tracking things like gross margin, Net Profit, Days Sales Outstanding, Customer loyalty and so on. These should be aligned with the stated goals of your business as agreed with shareholders. During the yearly cycle these KPIs will show whether you are on target to meet your objectives for the year or whether corrective action needs to be taken.

At lower levels, though, your organisation will be be divided into business units, functions, teams and so on. Each of these groups will have objectives defined, which if met should make the necessary contribution to the whole. They will also be required to report the detailed figures that feed into higher level KPIs.

In an ideal world, the pyramid of objectives and measures is completely aligned. The highest level objectives break down all the way to individual performance measures for each employee and supplier and if everyone meets their targets, the business as a whole meets it's target. If anything is going wrong, then drilling down will enable you to quickly pinpoint where exactly the problem is so that you can take corrective action.

How it usually works

In practice this seldom, if ever, happens. Defining such a pyramid of objectives and measures is a huge task and the world changes. On the one hand, it is human nature for us to isolate those things that we can control, especially if we are being measured on them. So a manager will want to define his KPIs based on his department's performance isolated from the rest of the business. On the other hand it can be complex and expensive to gather and process all the necessary data. Objectives change from year to year, and there simply isn't enough time to build the perfect pyramid of KPIs for them to be useful in any meaningful sense. The pressure to find shortcuts is enormous, and the question often asked is “what data do we already have that will tell us more or less what we need to know?”

For an example, let's return to our doctor and the annual check up. A widely used KPI is the Body Mass Index or BMI. This is easy to define and measure. It is simply your weight (in kg) divided by the square of your height (in m). In my case I weigh 78kg and I'm 1.73m tall, so my BMI is 26.0 which makes me overweight, so I should take corrective action. I should aim for 75kg or less . Although it is simple, the BMI has come in for some well deserved criticism, because it doesn't always tell you what you need to know to establish if someone is overweight, underweight or just right. Since muscle is heavier than fat, athletes often have high BMIs. Methods for accurately measuring body fat include calipers, electrical resistance or a full body X-ray scan.

BMI is an easy KPI to measure, but is not always useful

It's the same with the KPIs in your business. Many of them have been defined not because they give you an accurate measure of the health of your business, but because they are easy to measure and are related to the things that you want to know. A good example that many businesses measure is headcount. This is easy to measure, and if combined with payroll data it's easy to turn into money terms. So teams, departments and so on are set objectives in terms of staff numbers or staff costs. It's easy to measure and also relatively easy to control. As a result these objectives are met in most companies. Next to this, a team or a department will often be set a target in terms of output. By combining the two, you get an idea of productivity in terms of output per unit cost. The chances are though, that the output is defined based on what's easy to measure and not what's important. Just because a team or department is meeting their targets, it doesn't necessarily mean that they are adding value to the business.

How poorly defined KPIs can encourage a silo mentality

Consider the following scenario. A retail insurance company consists of a sales and marketing department, a claims department and a call centre that serves as the first line for both.

The sales and marketing department has a fixed budget for the year and a target for new contracts. For simplicity all calls to the call centre that are logged as “new contract” are counted towards this target. The claims department has a fixed budget for the year and a target for processing claims within a given timeframe.

The call centre also has a fixed budget and a target of average handling time for inbound calls.

Based on experience, the sales and marketing department have determined that the most cost efficient way of generating leads is to launch a small number of high profile campaigns, so plan 4 saturation campaigns spaced over the year. The call centre manager meanwhile calculates that the most cost effective way of staffing her call centre is to keep the headcount level and avoid overtime as much as possible. She knows from experience that she can manage the peaks by telling customers that they will be called back later if the call centre is busy. This even helps boost her average handling time KPI because both the initial call and the callback are counted, so what could have been a single call is now counted twice, and the average handling time is halved. What she cannot afford to do is to authorise overtime or hire in temporary staff to handle the peaks. That just increases her staff costs, and leaves her overstaffed during the quiet periods (and when it's quiet her agents spend longer with each customer, so the average handling time goes up).

So, what happens when the sales and marketing launch a saturation campaign? New customers suddenly start ringing in asking to sign up or wanting more information. The call centre can't cope and so does the bare minimum by logging the call and promising to call back later. They keep their KPIs under control. Critically each call is logged as a sales call, which feeds into the sales and marketing KPI, so they also meet their targets. What happens though when the customers are called back? Do they still want the insurance policy now that they know that the call centre hasn't got time for their call? Are they actually too busy right now doing something else, after all they originally called at a time that suited them? Have they checked out the competition and chosen their product? Whatever happens, some of them who would have signed up for the policy, now won't.

The point here is that each group is hitting their targets, and can cite their KPIs to prove it. It is clear though, that this company could perform better. Objectives and KPIs could also be defined that are more relevant and which encourage and reward better behaviour. For the sales and marketing department, they should be aiming at total number of new signed contracts. It's not the call that counts, but the signed contract. The call centre shouldn't be measured on average handling time, but on the notion of “first time resolution” i.e. did they successfully respond to the customer's request in a single call.

When you start defining KPIs like this, though, things start to become more complicated. Firstly the definition becomes more complicated as does the means of measuring it. Instead of just counting call logs you need to count new contracts, and then distinguish between new contracts and extensions, but what if a customer upgrades their contract, does that count as a sale or not? Defining first time resolution is also not easy. How do you distinguish between a customer who calls twice because the initial enquiry was not adequately dealt with and a customer who calls again for an unrelated matter?

Things also start to get more complicated because managers are forced to recognise their interdependencies. If the sales and marketing manager is now measured on converted contracts, then he is dependent on the call centre manager for ensuring that the calls generated by his TV advertising campaign are converted into signed contracts by the call centre. The call centre manager is also dependent on the sales and marketing manager because her team will be swamped if there is a blanket marketing campaign that generates large numbers of calls. A peak of calls will result in a dip in first contact resolution or an increase in headcount. This is a good complication, because it is clearly true. These two managers and their respective teams are interdependent, and if they are both to succeed, then they need to talk to each other and work out a plan where they both win. Well defined KPIs should encourage these discussions.

How to define intelligent KPIs

First of all, start with the assumption that the KPIs that you currently have are probably sub-optimal, that they are based on what's easy to measure rather than what's important to measure and that they are encouraging departments and teams to work in silos rather than to cooperate. However, for now, they are the best you have, so don't throw them away until you have something better.

Before going any further take a good look at your business and understand what really drives it. What drives your revenues, what drives your costs and what are your most important risks? What drives your competitive advantage and how important are customer loyalty or fraud to you? What are the factors that you can't control or which are the same for all competitors in your market? (If it's the same for everyone, there's no need to worry about it.)

Keep it simple

Once you have an understanding of the most important drivers, define a small number of KPIs that meaningfully represent these. Recognise that they will not tell you everything, but they will tell you the most important things. They will probably not be easy to measure, and neither will they align closely with your organisational boundaries. For both of these reasons, the discussions that you need to have will be difficult, but having those discussions is a valuable process in its own right – those discussions help to break down the silos and expose the inter-dependencies. Once you have defined these KPIs, then implement them. The data may be hard to find, or the processing to turn raw data into meaningful measures may be complex – that's one reason why you should limit the number of KPIs. Once you start measuring them, reporting on them and managing against them, you will discover new complications. Be prepared for some degree of iteration before the definitions and the targets stabilise.

Tuesday, 1 April 2014

The Human Side of Data Quality

In this blog I will be arguing that the human side of data quality is not just needed, but that actually the work that people perform is far more valuable than the work done by systems and tools.

Starting with a tools based approach

First though, I should admit that I didn't always hold this view. When I first started working seriously with Data Quality, I had my feet firmly in the IT camp: I believed that applying the right tools could was the key to fixing Data Quality. I had years of experience behind me of moving data around between systems and struggling with trying to get it to fit. I thought I knew what the solution should look like. As a self-confessed data geek, I was open to the claims of tool vendors that their technology based on smart matching algorithms and expert based rules could fix just about any Data Quality problem. After all they had demonstrations to show how good they were and references to back it up.

So, I took my first steps into applying technology to solve Data Quality problems. I started with addresses and de-duplication of customer records. As I was working in the household energy supply business at the time, the business case was based on reducing the total number of bills sent out, and maximising the number that ended up at the right address. Initial results were impressive, and the scale of the work involved meant that it could only be achieved by automation. Tens or hundreds of thousands of corrections made over a weekend was not uncommon.

The grey zone

As we went further, though, an uncomfortable truth started to emerge. There was always a grey zone where you couldn't really be sure if the tools were right or wrong. At the heart of any algorithmic or expert based approach is the technique of scoring possible outcomes. Based on the input data the tool will suggest a result along with a score. If the score is high, then you can be pretty confident that it is correct, and if it is too low then it should be discarded, because it’s probably wrong. The problem is where to define the cut off point. How high does the score need to be for you to trust the result? Set it too low and you will end up introducing errors, which is ironic since you are trying to eliminate errors. Set the threshold too high and you will discard perfectly good corrections.

At first this wasn't too much of a problem. When you are making tens or hundreds of thousands of corrections you don’t need to worry about three of four hundred dubious results. The temptation is to err on the side of caution, set the threshold comfortably high and avoid the risk of making anything worse than it is. The grey zone can wait, but only for so long. Once you have cleaned the bulk of your data, and you perform regular cleaning runs, the number of automated corrections that you can make reduces, but the number of near misses stays constant or slowly increases. Eventually you have to look into this in detail.

My initial observation was that the tools were making a lot of good suggestions in this grey zone, but to just lower the threshold and accept lower scores would introduce errors; not many, but enough. I spent a lot of time trawling through examples looking for patterns in this space. Two customers with the same family name in the same street, one called Tim, living at number 25, the other called Tom living at number 52… is it the same person or two different people? Are both addresses correct or has one been accidentally transposed? It’s a frustrating problem, but I found that most of the time, I could find the right answer if I looked beyond the dataset that was being fed into the tools. Reluctantly I admitted defeat, and started lobbying for a small team of customer service agents to go through these lists and clean them up by hand, and that worked well.

Adding people to the mix

There was an initial feeling of disappointment, though, and a sense that the technology was failing. Slowly it became clear that this was a structural problem, we built a layer on top of the tools that presented these agents with the source data as it was, the suggestions of the tools, details of the scoring rules and colour coding to indicate degrees of uncertainty. Next to that we added to links to popular external sources like the CRM system or Google Maps.

As this work progressed the approach was extended to cover different problems. Slowly it dawned on me that this was a structural problem, and that the human element was necessary. People are able to add context, and to look at other available sources including communication histories with customers, which enables them to come to the right answer on a case by case basis. Furthermore, it eventually became apparent that the corrections that they were making had a much higher value on a case by case basis than the automated changes. Small changes are easy to correct, but also less likely to cause real confusion. If an address is incorrectly recorded as Groove Street, Bath, BA1 5LR, the postman will know that it is destined for Grove Street because it’s a one letter difference that matches with the postcode and there are no other alternatives. Post in this case would still be delivered to the right address. Another address recorded as Queen Street, Bath, BA1 2HX is more problematic though because that postcode refers to Queen Square but there is a Queen Street in Bath with the postcode is BA1 1HE. To know whether the street name is wrong or the postcode would need more information. In this case post may very well be going to the wrong address. That means that correcting this error is more valuable than correcting the first error.

This is a pattern that occurs again and again with Data Quality. Corrections that are easy to make using technology and tools are usually high in frequency, but actually low in impact. Many of these errors do not make much real difference. Corrections that are not easy to make with technology, though, usually do matter and it’s worthwhile taking the time to look at these cases in detail, because it will make a difference.

Technology and people working together

Ultimately it’s worth combining people with technology, because they complement each other. What the technology does for you is to clear up the high volume of easy to fix errors. Dedicating people to this would be expensive, slow and has potentially little effect. The fact that the technology can do this and then identify the grey zone though is hugely valuable. The volumes here are much lower, but on a case by case basis they really do matter. Routing these cases to a team of people who can investigate them and make the necessary corrections is very worthwhile and has a high return for the investment. Good Data Quality tools today come with workflow extensions precisely for this, but unless you understand why, it is easy to overlook the importance of it.

If you have problems with Data Quality and you are evaluating tools to help you, then I advise you to be sceptical, but in a positive way. Don’t expect any tool to solve all your problems, and beware of any vendor claiming that their tools can. Dig a bit deeper and ask about what happens when the tools aren't sure? How easily can these cases be passed back to a team of Data Stewards? Does the tool come with a workflow engine that can route these cases to the right people? But, more importantly, think about preparing your organisation for this. The tools will only get you so far, but to get the most value from them, your own people will make the biggest difference.

Tuesday, 4 March 2014

Why Bayes’ Theorem means that Data Quality really matters if you are looking for Needles in Haystacks

Introduction

In this blog I'm going to explain why data quality and data lineage are so important for anybody looking for needles in haystacks using advanced analytics or big data techniques. This is especially relevant if you are trawling through large datasets looking for a rarely occurring phenomenon, and the impact of getting it wrong matters. Ultimately it comes down to this: the error rate in your end result needs to be smaller than the population that you are looking for. If it isn't then your results will have more errors in them than valid results. Understanding Bayes' theorem is the key to understanding why.

I discovered this phenomenon last year when I applied Business Analytics to a particular problem that I thought would yield real results. The idea was to try and predict a certain kind of (undesirable) customer behaviour as early as possible and so try and so prevent it before it happened. While the predictive models that came out of this exercise were good, they were ultimately unusable in practice because the number of false positives was unacceptably high. What surprised me most was the fact that the false positives outnumbered the true positives. In other words, when the algorithm predicted the undesirable outcome that we were looking for, the chances were that it was wrong… and the impact of using it would have effectively meant unacceptable discrimination against customers. I was surprised and disappointed because this was a good algorithm with a good lift factor, but ultimately unusable.

Bayes' Theorem

At the same time I was reading a book called Super Crunchers, by Ian Ayres, which by the way I thoroughly recommend, especially if you are looking for a readable account of what can and can’t be done through number crunching. Towards the end of the book is an explanation of Bayes' theorem and how to apply it when interpreting any kinds of predictive algorithms or tests. Now, I learned Bayes' theorem when I was at school, but this chapter was a really useful reminder. The theorem allows you to infer a hidden probability based on known, measured probabilities When I applied it to the problem I just described above, it made a lot of sense.

How 99% accuracy can lead to a 50% error rate

What I learnt was this… if you are applying an algorithm to a large population in order to find something that only affects a small minority of that population then you need to have a good idea of how reliable your end result is in order to work out how many of its predictions are wrong. The end result is highly dependent on the quality of the data that you start with, and then the quality of the analytics algorithms you use as well as any intermediate processing that you do on it. What’s more, if the error rate is about the same as the size of the population you are looking for, then around half of the predictions will be false. So, if you are looking for something that affects 1% of your population and you have an algorithm that is 99% accurate, then half of its predications will be wrong.

To demonstrate this, I’ll use a hypothetical credit screening scenario. Imagine that a bank has a screening test for assessing creditworthiness that is 99% accurate, and they apply it to the all customers who apply for loans. The bank knows from experience that 1% of their customers will default. The question then is, of those customers identified as at risk of default, how many will actually default. This is exactly the sort of problem that Bayes' theorem answers.

Let’s see how this would work for a screening 10,000 applicants.

Of the 10,000 customers, 100 will default on their loan. The test is 99% accurate, so it will make 1 mistake in this group. One customer will pass the screening and still default later. The other 99 will receive a correct result and be identified as future defaulters. This would look pretty attractive to anyone trying to reduce default rates.

In the group of 10,000 applicants, 9,900 will not default. The test is 99% accurate, and so 99 of those customers will wrongly be identified as credit risks. This would look unattractive to anyone who is paid a commission on selling loans to customers.

So, we have a total of 198 customers identified as credit risks, of which 99 will default and 99 will not. So in this case, if you are identified as a credit risk, then you still have a 50% chance of being a good payer… and that’s with a test that is 99% accurate. Incidentally, the chances of a customer passing the credit check and then defaulting are now down to 1 in 10,000.

Accuracy = 99%	Default predicted
Customer Defaults	True	False	Total
True	99	1	100
False	99	9,801	9,900
Total	198	9,802	10,000

It occurred to me that this logic is valid for any situation where you are looking for needles in haystacks, and that is the kind of thing that people are doing today under the banner of Big Data – trawling through large volumes of data to find “hidden gems”. Other examples would include mass screening of DNA samples to find people susceptible to cancer or heart disease or trawling through emails looking for evidence of criminal behaviour.

Now, in the example I gave, I deliberately used clean numbers where the size of the minority we’re looking for (1%) was equal to the error rate. In reality they are unlikely to be equal. The graphic below shows how the rate of false positives varies as the size of the minority and the overall error rate change. What this graph shows very clearly, is that to be useful, the end result of your prediction algorithm needs to generate fewer errors in total than the size of the target population that you are trying to find. Furthermore, if you are looking for small populations in large data sets, then you need to know how reliable your prediction is, and that is highly dependent on the reliability of the data you are starting with. If you are looking for 1 in 1,000 then 99% accuracy isn't good enough, because then 90% of your results will be wrong.

What makes a good predictive model?

So let’s imagine that we want to apply Big Data techniques to find some needles in a large haystack of data, and we are expecting these to occur at around 0.1% of our total population. Bayes' theorem tells us that to be useful, our predictive algorithm needs to be more than 99.9% accurate. There are two ways of improving the accuracy, either by using the best possible algorithm or by using the best possible data available.

There’s been a lot of work done on algorithms and nearly all of it is available. Market leading analytical software is affordable, so if you really are looking needles in haystacks and there is value in it, the tools are available to you.

What’s less obvious is the quality of the input data. While algorithms are reproducible, good data isn't. Each new problem that someone needs to tackle needs its own dataset. Each organisation has its own data for solving its own problems, and just because one bank has good data, doesn't mean that all banks have good data.

The chances are that if you are looking for needles in haystacks, then what you’re finding probably aren't the needles you were looking for at all. If you've assessed the reliability of your predictive model then you may even be wildly over confident in the results. While you can invest in better algorithms, if you really want better results you will probably only get them by using better data.

Friday, 31 January 2014

The Benefits of a Single Customer View

In previous roles, I have delivered Single Customer View projects twice. In each case I prepared a very positive business case before getting approval for the project, and in each case the benefits were evaluated after implementations. Unlike other projects that I have delivered the business case at the end actually ended up being more positive than originally forecast.

In this blog I would like to share with you how a well implemented Single Customer View can deliver benefits far beyond the associated costs.

What is a Single Customer View, and why would you want it?

Put simply it’s about having a single “golden” record for each of your customers, so that whenever you refer to a customer, you’re using the most up to date details, and everything you know about that customer is correctly linked to them. It means that when a customer calls your call centre, you can find their details quickly, and can see their history. It means that you know how many customers you actually have, and what they are worth to you. It means that customers are central to your business model rather than accounts, subscriptions or orders.

Single Customer View as a requirement for compliance

If a government or regulator mandates that you strive to achieve a Single View of your customers, then you don’t need a business case. Either you do it, or you lose your licence to operate. An example of this is the Know Your Customer rules that apply to banks, who need to successfully identify their customers in order to avoid being used for money laundering activities.

However, I would argue that the other benefits that come from a Single View are so important that you should consider them too. If it is mandatory, there is a temptation to stop there and do it because you have to. I would encourage you to keep reading. If you have to implement a Single Customer View, then you may as well get as much value out of it as possible.

Cost reduction

If you have a large number of customers and communication with them is a major cost for your business then you can make surprisingly big savings by implementing a Single Customer View. This would typically be an important part of the business case for energy suppliers, telecoms, insurers, banks or local government where regular postal communication to all of your customers is a major cost. In some cases, it may be worth going one step further and thinking in terms of Single Household, and thereby avoid needless costs by sending duplicate communication to everyone in a household.

Risk Management

One of the reasons why governments have mandated Know Your Customer practices for banks is to facilitate better risk management. Each customer has their own risk profile, and if you are managing that risk profile, then it will be a lot more accurate if you make the effort to link everything that you know about a customer via a single customer record. The risk profile for a customer will be much more complete if you can link their borrowing to their savings for example. If you can identify a returning customer who has been with your competition for a while, then you can assess their risk profile much more accurately than if you treated them like a new and unknown customer.

The more accurately you can assess your total risk, the less provision you need to make to cover unknown risks. This in turn releases capital for other opportunities or further investments.

Customer Loyalty

It often comes as a surprise, but most customers actually like it when the companies they deal with have a complete picture of them. It boosts confidence that they are treated as an individual and not just as an account. Admittedly it’s not universal, and some customers value their anonymity, but in my experience the overwhelming majority of customers expect you to know them. When customers hold multiple accounts with you, they expect you to join up the dots, and treat them as individuals. Personally, I like it when Amazon suggests books to me that I might like but I get annoyed when my bank tries to sell me a credit card that I already have. One is showing me that they know me, and the other is showing me that they don’t care about me.

Fraud

If combating fraud is important to your business, then a Single Customer View can be an invaluable weapon in your armoury. This is particularly true if you are operating on tight margins. Losses due to fraud scale typically with turnover, but they go all the way to the bottom line. So, if you are losing 2% of your turnover to fraud, this could easily be 20% of your net profits.

Fraudsters benefit from anonymity and multiple accounts, but many of them don’t apply particularly sophisticated techniques. Significant fraud can be avoided simply by matching small changes in names or dates of birth. Even more can be avoided if you can identify members of the same household who take it in turns to run up debts they have no intention of repaying.

I have seen business cases where fraud reduction was the biggest single benefit for the Single Customer View, and yet it is often overlooked because fraud prevention is not considered as part of the core business.

Business Analytics

The benefit in terms of Business Analytics is one of the least obvious benefits, and also the most difficult to quantify in advance. Nevertheless, I have seen the difference that it makes, and it can be substantial if your business analytics are aimed at understanding customer behaviour in order to be able to anticipate it. Examples are churn prediction or credit scoring.

In these scenarios the benefit comes from improving the quality of the data going in to your analytics. Practitioners of business analytics generally agree that improvements in the quality of input data have a far greater effect than using the latest algorithms. It’s not accounts that decide to leave your company for the competition, it’s customers. If one customer is having problems on one of his four accounts, then he won’t just take that one account to the competition. He will take all four, but if you don’t know that they are linked, you won’t be able to see it coming. Similarly, by linking all of a customer’s accounts you will be able to assess their credit rating accurately when they place a large order that you don’t want to lose but can’t afford to give away.

Conclusion

I have highlighted six areas that I think you should consider when evaluating the benefits side of the business case for the Single Customer View. If you are considering implementing a Single Customer View, the chances are that you are championing one of these benefits as the case for action. I would strongly recommend that you consider the others. For most companies, I would expect the benefits to be at least twice the costs. If you time it right it can even be possible to achieve pay back in the same financial year that you launch the project.