Statistical Sampling for Audit

Introduction

quality

After 27 years in the business process outsourcing arena, I’ve seen many deals end in a disagreement on the SLA (Service Level Agreement) results. It usually comes down to how each group is measuring the data. Too often I see small samplings of non-random data being used to infer the quality of the whole population. Let's look at what a statistically valid sample set is and how to use it.

Why It Goes Wrong

When times are good, everything is easy. Deliverables are being met and the client is happy. Things get more challenging when SLA’s are not met. In a good customer/provider relationship, both parties will discuss what happened and what the corrective action plans should be. The provider could implement and review results with the customer.

Things can go wrong when the customer's first impulse is to micromanage. If you aren’t hitting your quality SLA, the customer may say they want a 100% audit of the work. This can actually have the opposite effect and further hurt your quality. Perhaps it is a select few staff that are struggling and bringing down the team’s average. If so, it would be best to focus on those individuals. Putting everyone at 100% audit significantly increases the work effort of your QA team. Since they must review all the work, they can’t spend time working with the truly poor performers. It takes longer to get improved results because they are focusing on the wrong spot.

Another common error is micromanaging down to the individual level. One of the reasons you outsourced is because of the provider's expertise and getting away from daily individual staff member management. Don’t let past experiences with your internal team pull you back to that mindset. It’s at this level of oversight that I see the misuse of statistics often come into play.

Customer: “We audited two widgets that Johnny built and one of them was incorrect. Therefore, Johnny’s quality is 50%. We want him off the project.”

This might get a bit technical, so let’s give statistics its own section…

Statistics

The hypothetical example given above is meant to show you need to ensure you have a statistically valid sample size before making any assumptions. The only true way to know the exact quality of a process is to measure every single output. That isn’t cost effective for anyone so instead, we need to create a sample population that will reflect the quality of the overall population. You are probably familiar with the Nielsen ratings for TV. This was a sampling of families' viewing habits that was translated to the total U.S. population. Similarly in an outsourcing environment, we want to review a population large enough to represent the overall quality.

We don’t have the space in this paper to dive into the technical details of what makes up a statistically valid formula, so to simplify, we can rely on a number of online calculators that can compute it for us. We'll will focus on the results.

For the calculation, there are a few data points we need - the margin of error and the confidence level. The confidence level is the probability the results fall within a specific range of values. Typically, 95% confidence level is used. Don’t confuse this with an SLA of 95% quality. This just states how confident you are in the quality results. It doesn’t tell you what the quality is. That comes from your measurement of the sample. The margin of error is the amount of random sampling errors in the results. The smaller the margin of error, the more accurate the calculation. A typical value for margin of error is 2%. You’ve seen this in every election where they project John Q. Congressman will win with 75% of the votes +/– 3%.

Let’s look at an example to illustrate this. Assume you have 30,000 widgets a month to outsource and you want to determine how many to audit to determine the quality of the population. We want a 95% confidence level and a margin of error of 2%. Plugging into our calculator, we get a sample size of 2,224. For those that want the math behind this:

sample-size

A sample of 2,224 units is about 7.4% of the 30,000 widgets, but that isn’t the important number. If our monthly volumes increased to 120,000 widgets, would we still need to audit 7.4% to get a statistically valid result? Let’s plug it in and look…

Results

Your recommended sample size is: 2354

As you can see, at four times the volume, we only need to review less than 2% of the population. The amount needed to sample only increased by 130 units.

In this example, if we review 2,354 units and find that the quality of those is 96%, what does this tell us? Does it guarantee that our total population is 96% accurate? No! What it tells us is we are 95% confident (confidence level) that the true population average would be anywhere between 94% and 98% (margin of error). To explain the confidence level better, it is saying if we were to sample 100 lots of 120,000 widgets, each at a sample of 2,354, we would find that 95% of the time the quality would fall in the range of 94% and 98%. The other 5% of the time, the results would be outside of those bounds.

Increasing Confidence

What can we do to increase our confidence in our quality review? Using our 120,000 units, we can play with the two variables from above: confidence level and margin of error.

Instead of a 95% confidence level, what if we wanted it to be 99%? What is our new sample size?

Results
Your recommended sample size is: 4009

Increasing the confidence level nearly doubles the work effort needed. But this still doesn’t tell us everything we want to know. Again, this only tells us that we are 99% confident that the true population average would be anywhere between 94% and 98%.

Let’s go back to the 95% confidence level and change up the margin of error to 1% and see what that does.

Results
Your recommended sample size is: 8893

You can see adjusting the margin of error has a bigger impact on the sampling size. Now we can state we are 95% confident the true population average would be anywhere between 95% and 97%. The amount of work needed is the tradeoff, as we have almost increased the effort by a factor of 4.

One more time, let’s adjust both levers - 99% confidence rating with a 1% margin of error.

Results
Your recommended sample size is: 14575

Our work effort is now about 12% of the population or 6x what we were from our first test. The tradeoff is now we can state we are 99% confident the true population average would be anywhere between 95% and 97%.

Is more better?

Is there value in auditing more? Maybe. If you are building components for an airplane, I don’t think anyone would be happy with a 95% confidence level or a margin of error of 2%. In that case, much more audit needs to be done as you will want 99.9999…% accuracy. However, if you are testing the shape of styrofoam packing peanuts, you are probably fine with getting close.

The tradeoff really comes down to the gain you receive for adding expense. Auditing is not cheap. Especially if it is a manual process as the cost of labor worldwide is increasing. For most backend office work, a statistical sampling based on a 95% confidence level and a 2% margin of error should be sufficient to measure your output. It gives you a “good enough” idea on where your quality falls and will raise red flags if you are outside of your SLA.

Random Sampling

One key point to remember on the sample sizes described above is each must be a random sample. That means each member of the population has exactly the same chance of being selected. This ensures your sample represents the whole population. Here is another area where I see statistics being used incorrectly.

Sometimes a customer knows that Tim, Frank, and Sue are poor performers. So they want to audit more of their work. It makes perfect sense for the provider to review higher amounts of the work by the poor performers and then work with them to improve. However, focusing on the poor performers as part of the final quality audit will negatively skew your results.

A larger portion of your sample are poor performers and not a random sampling. Auditing the poor performers only tells you the quality of the poor performers, not the quality of the total population. At the same time, make sure your provider isn’t skewing the data the other way by auditing only the good performers. The customer and provider should agree at the start how the measurements will be conducted so both parties can ensure a valid result.

Conclusion

Know that 32.4% of statistics are made up on the spot. Just like my made-up stat, make sure you understand what is behind the data you are looking at. As shown in the examples above, statistics, if not validated, can lead you in the wrong direction. If you look at two units for a person and one is incorrect, can you really say that all their work is only 50% accurate?

When someone tells you their quality, always ask what is behind the data. The details above should arm you with enough data to know if the information they are providing is reliable.

Strategy and Consulting

Provider Services

Payer Services

Risk Adjustment

Business Intelligence and Analytics

Statistical Sampling for

Audit