Muffin or Chihuahua: Confusion Matrix and the Base Rate Fallacy

Muffin or chihuahua?

Poodle or fried chicken?

Puppy or bagel?

Shar-pei or towel?

Sheepdog or mop?

Shar-pei or croissant?

Dog or teddy bear?

You may have come across these images if you work in a technical field. These mix ups are a fun way of testing classification accuracy.

But what exactly does it mean when one says accuracy?

This post highlights that ‘accuracy’ is surprisingly nuanced, and is often susceptible to the Base Rate Fallacy. This frequently overlooked math error fools 80% of doctors, and I can confirm from firsthand interaction, fools a surprisingly lots of corporate senior executives too.

Suppose there’s a photo collage of 10 chihuahuas and 90 blueberry muffins.

A robot armed with a powerful AI algorithm is tasked with tagging the ones that are chihuahuas.

It selects 9 chihuahuas, as well as another 27 muffins it believes are chihuahuas.

Question: What was the robot’s accuracy?

A) 90% since it correctly tagged 9 of 10 chihuahuas.

B) 72% since it correctly classified muffin as muffins and chihuahuas as chihuahuas 72 times out of 100.

C) 25% since only 9 out of the 36 (= 9 + 27) tagged sample were actually chihuahuas.

Somewhat unintuitively, the correct answer is B) 72%.

Most people use the term ‘accuracy’ without understanding its proper statistical definition, and without realizing that ‘accuracy’ comes in various flavours.

The confusion matrix shows the 5 variants of accuracy: sensitivity (aka recall), specificity, precision, negative prediction value, and accuracy. Understanding this distinction is important as each of these numbers are completely different.

The confusion matrix is confusing right?

Let’s walk through these terms one at a time.

A true positive (TP) is a relevant (in this case, picking out the chihuahuas) item that was tagged as relevant. Where ‘relevant’ here refers to picking out the chihuahuas, and ‘item’ refers to each individual chihuahua-or-muffin image. These are correct tags. In our example, the robot tagged 9 chihuahuas as chihuahuas. Therefore, TP = 9 .

A false positive (FP) is an irrelevant item that was tagged as relevant. These are incorrect tags. In our example, the robot tagged 27 muffins as chihuahuas. Therefore, FP = 27. These are also known as Type 1 Errors. False positives are false alarms.

A true negative (TN) is an irrelevant item that was tagged as irrelevant. These are correct tags. In our example, the robottagged 63 muffins as muffins. Therefore, TN = 63.

A false negative (FN) is a relevant item that was tagged as irrelevant. These are incorrect tags. In our example, the robottagged 1 muffin as chihuahuas. Therefore, FN = 1. These are also known as Type 2 Errors. False negatives are alarm malfunctions – it didn’t go off when it should have.

You can see that Type 2 errors are more dangerous than Type 1. If a medical test says you have cancer but you actually don’t (false positive), the most damage this will do is causing a nuisance of having to do more exams. But if you actually have cancer and the medical test said otherwise (false negative) you’re in deep trouble.

In general, think of positive and negative as the robot’s answer. Robot says “I’m positive these are chihuahuas. I’m positive these are muffins.”

Conversely, think of true and negative as an objective human examiner’s answer: “Good job this positive is true. But that positive is actually false.”

The confusion matrix should start making more sense now. Now we go into the 5 ‘accuracy flavours’.

Tip: numerator is always one of the ‘true’ quantities (TP or TN or TP+TN). Source

Sensitivity is the portion of relevant items that were (correctly) tagged as relevant. Also known as recall. In our example, of 10 chihuahuas in total, 9 were tagged as chihuahuas. Therefore, sensitivity = 9/10 = 90%.

Specificity is the portion of irrelevant items that were (correctly) not tagged. In our example, of 90 muffins, 63 were not tagged as chihuahuas (ie muffins correctly classified as muffins). Therefore, specificity = 63/90 = 70%.

Precision is the portion of all tagged items that were (correctly) relevant. Also known as Positive Prediction Value. In our example, of 36 tagged items (which the robot believes are all chihuahuas), only 9 were actually chihuahuas. Therefore, precision = 9/36 = 25%.

Negative Prediction Value is the portion of untagged items that were (correctly) irrelevant. In our example, of 64 untagged items (which the robot believes are all muffins), 63 were actually muffins. Therefore, negative prediction value = 63/64 = 98.4%.

Accuracy is the portion of all items that were correctly tagged. In our example, of 100 items, 9 chihuahuas and 63 muffins were correctly tagged. Therefore, accuracy = (9+63)/100 = 72%.

This is why the correct answer to the accuracy multiple choice question earlier was B) 72%.

Now that we’ve wrapped our head around the confusion matrix, we’ll now delve into a rather unknown cognitive bias called the base rate fallacy.

Here’s a question that 80% of doctors get wrong.

Suppose breast cancer is prevalent among 1% of women, with 90% sensitivity and 91% specificity. If a woman tests positive, what are the chances she actually has breast cancer?

Recall, sensitivity is the “of 10 chihuahuas in total, 9 were tagged as chihuahuas”. And specificity is the “of 90 muffins, 63 were not tagged as chihuahuas.” Translating the previous example into this one, think of having breast cancer as a chihuahua.

Have a go before you scroll down…

Solution:

“Women who test positive” is TP + FP
Within this subset “how many actually have cancer” is TP.
So basically, question is asking “what is TP/(TP+FP)?” ie what is the precision rate?
Let’s say 1,000 women got tested
1% prevalence means, of 1,000 total women, those that have breast cancer (TP + FN) = 1% of 1,000 = 10
This means ‘healthy’ women who don’t have breast cancer (TN + FP) = 1,000 – 10 = 990
91% specificity means, of 990 ‘healthy’ women, those that got a true negative (TN) = 91% of 990 = 900.9
This means ‘healthy’ women that freak out and receive a false positive (FP) = 990 – 900.9 = 89.1
90% sensitivity means, of 10 women that have breast cancer, those that got a true positive (TP) = 90% of 10 = 9 (phew! detect and treat)
This means women that have breast cancer but have false sense of security due to getting a negative test (FN) = 10-9 = 1 (oh no! she’ll live on undetected thinking she’s healthy when she’s not)
So, number of women who tested positive (TP+FP) is 9+89.1 = 90.1
Of these, 9 actually have cancer (TP)
Therefore, answer is 9/90.1 which is closest to C) 1 in 10.

About half of doctors answered A) 9 in 10. Another 30% chose other incorrect answers. And only 20% chose the correct answer.

Imagine this. You’ve just got your breast cancer test result and it comes out positive. “You have breast cancer”. You freak out and ask your doctor “how accurate is this test?!”

Looking at the test accuracy statistics (prevalence 1%, sensitivity 90%, specificity 91%), half of the doctors would reply “I’m sorry – the test is pretty accurate. About 90% accurate. You’ll have to do seek additional medical attention and conduct further tests.”

But the truth is, only 1 out of 10 women that tested positive actually had cancer! Not 9 out of 10.

How could so many doctors (who are supposed to be quite intelligent) get the answer so wrong?

What makes the answer deceptive at first, but obvious in hindsight?

People underestimate the effect of false positives!

Even though 90% sensitivity and 91% specificity rates seem pretty good, the impact of the 9% false positive rate is grossly underestimated.

This is an example of the Medical Test Paradox, which is a type of Base Rate Fallacy.

When we think about this more deeply, we start to understand that:

Tests do not determine whether you have a disease.
Tests don’t even determine your chances of having a disease.
Rather, tests UPDATE your chances of having a disease.

This means that given 1% prevalence, 90% sensitivity, 91% specificity…

Before being tested, a woman’s chance she has breast cancer is 1%
If she receives a positive test, her chances changes from 1% to 10% (as we saw earlier, precision is 1 in 10)
Let this sink in again. A woman receiving a positive test does not mean she has cancer. Rather, her chances of having cancer went up from 1% to 10%.
Conversely, if a woman receives a negative test, her chances changes from 1% to 0.1%. It’s still not zero. Given 90% sensitivity, she could be that 1 in 1,000 unlucky woman.

Whenever we see a probability needing to be updated based on new information, we enter Bayesian probability territory. We won’t go into this here, but I highly recommend this video.

Example 3: Speech Analytics

I work at a B2B AI voice recognition software company that analyses customer conversations to extract useful business insights (e.g. identify upsell opportunities, predict churn, detect potential compliance breaches, check for presence of financial product disclosure statements etc).

A common question I get asked by potential clients is: “what is your solution’s accuracy?“

In practice I’d answer with something like: “between 80-95% depending on a number factors… varies widely for every customer… why don’t you give us some of your calls and let us show you in a Proof of Concept?”

Suppose we did the Proof of Concept, and the results are as follows.

Note I’ve used the same numbers as muffin-or-chihuahua example so we don’t have to repeat the math.

So accuracy, in a statistical sense, is 72%.

Then my conversation with them would go something like this.

Me: “Based on this Proof of Concept, the accuracy was 72%. The recording quality of your calls was quite poor, most likely because your contact center agents were equipped with proper headsets. This is an easy fix though. Once they have headsets, we expect to see accuracy go up to about 90%.”

Exec: “72%. No, I don’t think that’s right. Your system flagged a number of calls as having a potential upsell opportunity. I listened to 36 of them and only 9 actually had actual upsell opportunities in them. So accuracy is not 72%. It’s more 25%. Your solution is not very accurate.”

Me: “You’re right in saying that only 25% of the calls you listened to were true positives. And yes, a 75% false positive certainly brings an inconvenience as 3 in 4 calls are false alarms. Yes this will cost your sales agents additional time as they have to go through 4 calls in order to find the 1 with a sales opportunity. However, what matters is that we managed to find 9 of the 10 actual sales opportunities. Doesn’t this tangible dollar value more than offset the inconvenience of false positives?”

Exec: “Yes finding the 9 sales opportunities was good. But still, there’s just too many false positives. Can we tweak the settings so that there are less false positives?”

Me: “We certainly can. However I’d advise not to. Here’s why: if we configure the settings to show less false positives, it will likely also show less true positives. That is, more sales opportunities will slip through the cracks as it won’t be detected in our system (false negatives). We always want to balance false positive and false negatives.”

Exec: “Can we just see what this looks like when you configure the system rules to reduce false positives.”

Me: “Sure. We’ve updated the configuration. Here are the new results. Accuracy is now 87%.”

Exec: “Yes 87% sounds better than 72%. Also, I looked through the 15 calls and there were only 9 false positives. False positive rate went down from 75% to 60%. This configuration is better. Let’s keep it to these settings”

Me: *facepalm…

The 3 additional sales opportunities identified brings value to the business that more than offsets the inconvenience of more false positives to work through. Some clients more familiar with technology-assisted business processes, see false positive rates of 95%, even 99%, as the healthy norm. In fact, they want false positive rates to be this high, especially when monitoring for compliance. Because false negatives are unacceptably risky for the business.

In this example, the higher 87% accuracy actually yields an inferior business outcome to the lower 72% accuracy case. Assuming that an additional sales opportunity identified is worth more than the marginal labour cost of processing another false positive.

Simply put, one needs to be cautious when trading off false negatives and false positives.

So in this particular case, looking only at statistical accuracy is not only misleading, it’s just analytically fallacious.

While this conversation was purely hypothetical, you may be surprised to learn how many senior executives do not understand this math. I can’t blame them though.

The confusion matrix is confusing, and the base rate fallacy is even more confusing.

This is why I advocate gaining comfort around mathematics.

It makes you a better thinker and decision maker. Who knows, perhaps one day you’ll be put in a position to make decisions with millions of dollars, or even millions of lives on the line.

Appendix:

A common alternative to statstical accuracy is the F1 score. Most data scientists prefer to use this on grounds that it is a better and more holistic measure of how good a prediction model is.

Another alternative is AUROC: Area Under the Receiver Operating Characteristic (AUROC) curve. Unlike statistical accuracy, AUROC is not affected by the changes in the distribution of the dataset of the underlying ‘items’ you’re trying to classify.

Thanks for reading!

Whenever I’ve accumulated enough interesting things to share, I send out an email newsletter. Subscribe here:

False Dichotomies: an Antidote to Lose-Lose Situations

Share: