Muffin or chihuahua?
Poodle or fried chicken?
Puppy or bagel?
Shar-pei or towel?
Sheepdog or mop?
Shar-pei or croissant?
Dog or teddy bear?
Damn cute right?
You may have come across these images if you work in the AI field – especially in computer vision. They’re a fun way of testing how accurate an AI algorithm is at classifying similarly looking objects.
But what exactly does it mean when one says accuracy?
We will soon reveal that accuracy is rather nuanced, and this gives rise to the Base Rate Fallacy. A overlooked math error that apparently fools 80% of doctors, and as I’ve witnessed from firsthand interaction, fools a surprisingly large amount of corporate senior executives too.
In similar spirit to my post on ergodicity, this is another one of those many-are-making-a-math-mistake type post. Because: less analytical fallacies, means better thinking, better decisions, better work and life.
Let’s start simple with a light-hearted muffin-or-chihuahua example.
Example 1: Muffin or Chihuahua
A boat is carrying 10 chihuahuas and 90 blueberry muffins.
It capsizes. A robot is sent to identify and rescue the chihuahuas.
But the robot’s visual classification (computer vision) capability is far from perfect.
The robot returns with 9 chihuahuas, as well as 27 muffins it thought were chihuahuas.
What is the robot’s accuracy?
A) 90% since it correctly saved 9 of 10 chihuahuas.
B) 72% since the robot correctly classified muffin vs chihuahua 72 times out of 100.
C) 25% since only 9 of the 36 (9 + 27) units rescued were actually chihuahuas.
Intuitively, I guess most would pick A) 90%. However the correct answer is B) 72%. Herein lies the challenge.
When most people say accuracy, they are not aware of the statistical definition of accuracy, nor do they know that there are many ‘flavours’ of accuracy.
The Confusion Matrix
This is the confusion matrix. It’s incredibly important terms to wrap your head around in everything here in order to dive deeper into this.
Confusing huh? Let’s go one by one with our muffin-or-chihuahua example.
True positive (TP): Correctly classified relevant thing as relevant. e.g. robot said it’s a chihuahua, and it really was a chihuahua (9 correct).
False positive (FP): Incorrectly classified an irrelevant thing as relevant, i.e. false alarm. Also known as a Type 1 Error. e.g. robot said it’s a chihuahua, but it was really a muffin (27 incorrect)
True negative (TN): Correctly classified irrelevant thing as irrelevant. e.g. robot said it’s a muffin, and it was really a muffin (63 correct).
False negative (FN): Incorrectly classified relevant thing as irrelevant, i.e. alarm didn’t go off when it should have. Also known as a Type 2 Error. This is the worst and most dangerous outcome. e.g. robot said it’s a muffin, but it was really a chihuahua that needed to be saved (1 incorrect).
If still confused, here you go:
Recognise that Type 1 Errors are more tolerable than Type 2 Errors. If a medical test says you have cancer but you actually don’t (false positive), worst case is the nuisance of the additional examinations you’d have to go through. But if you really did have cancer, but the medical test said otherwise (false negative) you’re in deep shit.
Back to the confusion matrix.
Sensitivity is the portion of relevant items that were selected. Also referred to as Recall.
– Of 10 chihuahuas, 9 were correctly selected = 90%
Specificity is the portion of irrelevant items that were (rightly so) not selected.
– Of 90 muffins, 63 were rightly not selected, i.e. correctly classified as muffins = 70%
Precision is the portion of selected items that were relevant. Also referred to as Positive Prediction Value.
– Of 36 selected items (supposedly chihuahuas), only 9 were actually chihuahuas = 25%
Negative Prediction Value is the portion of unselected items that were correctly irrelevant.
– Of 64 items not selected (supposedly muffins), 63 were correctly classified as muffins = 98%
Accuracy is the portion of all items that were correctly classified.
– Of 100 items, 9 chihuahuas and 63 muffins were correctly classified = 72%
This is why the correct answer to the accuracy multiple choice question earlier was B) 72%.
Note that data scientists also use a thing called the F1 score as an alternative to statistical accuracy, on grounds that it is a better measure of how good a prediction model is.
Example 2: Medical Test
Armed with these concepts, we now approach a more practical example.
Here’s a question that 80% of doctors got wrong.
Suppose breast cancer is prevalent among 1% of women, with 90% sensitivity and 90% specificity. If a woman tests positive, what are the chances she actually has breast cancer?
Have a go before you scroll down…
- Let’s say there’s 1,000 women.
- 1% prevalence means 10 women have breast cancer.
- 90% sensitivity means that 9 of the 10 women with breast cancer were detected by the test (remember: sensitivity is portion of relevant items selected).
- 91% specificity means that 910 of the healthy (no breast cancer) women received a true negative test result (remember: specificity is portion of irrelevant items rightly not selected). This means the remaining 90 breast cancer-free women received a false positive.
- So the total number of women who received a positive test result = 9 true positive + 90 false positive = 99.
- So number of women who tested positive that actually have cancer is 9 in 99, or 1 in 11, which is closest to multiple choice option C) 1 in 10.
- Note that this question is just asking for the precision (remember: portion of selected items that are relevant).
Half of the doctors answered A) 9 in 10. Only 20% chose the correct answer.
Why is it that so many were wrong? What makes the answer deceptive at first, but obvious in hindsight?
People underestimate the effect of false positives!
Even though a 90% sensitivity and 91% specificity sounds pretty good, the effect of the 9% false positive rate has on precision is intuitively underestimated.
This is an example of the Medical Test Paradox, which is a type of Base Rate Fallacy.
When we think about this more deeply, we should start to understand that:
- Tests do not determine whether you have a disease.
- Tests don’t even determine your chances of having a disease.
- Rather, tests update your chances of having a disease.
So this example above can be thought of as:
- 1% prevalence, 90% sensitivity, 91% specificity.
- With no test information, a woman has a 1% chance of having breast cancer.
- Given that she receives a positive test, how does this update her probability of having breast cancer?
- Whenever we see a probability needing to be updated based on new information, we enter Bayesian probability territory.
- We won’t go into this here, but I highly recommend this video. It hands down the best explanation of Bayes Theorem that I have come across.
Example 3: Speech Analytics
I work at an AI software company that analyses customer conversations to produce actionable business insights (e.g. identify sales opportunities, increase sales conversion, reduce compliance risks etc).
A common question I get asked by potential clients is: “what is your solution’s accuracy?“
In practice I’d answer with something like: “between 80-95% depending on a number factors… varies widely for every customer… why don’t you give us some of your calls and let us show you in a Proof of Concept?”
Suppose we did the Proof of Concept, and the results were as follows.
Note I’ve used the same numbers as muffin-or-chihuahua example so we don’t have to repeat the math.
So the answer to the accuracy question is technically 72%.
The following conversation is hypothetical, but a similar situations have really occurred.
Me: “Based on this Proof of Concept, the accuracy was 72%.”
Exec: “I looked through the 36 calls your system flagged. Only 9 actually had sales opportunities in them. So accuracy is not 72%. It’s 25%. This solution is not very accurate.”
Me: “You’re right in saying that only 25% of the flagged calls were true positives. A 75% false positive may at first seem inconvenient as 3 in 4 calls would be false alarms. However, we still managed to find 9 of the 10 actual sales opportunities. Doesn’t this tangible value more than offset the inconvenience of false positives?”
Exec: “Yes finding the 9 sales opportunities was good. But still there are too many false positives. Can we tweak the solution so that there are less false positives.”
Me: “We can. however there’s always a balancing act between false positives and false negatives. Generally, we care more about reducing false negatives, even if this means we have more false positives.”
Exec: “Can we see what this looks like when you configure the system rules to reduce false positives.”
Me: “Sure. We’ve updated the configuration. Here are the new results. Accuracy is now 87%.”
Exec: “Yes 87% sounds better than 72%. Also, I looked through the 15 calls and there were only 9 false positives. False positive rate went down from 75% to 60%. This configuration is better.”
In this example, the higher 87% accuracy actually yields an inferior business outcome to the lower 72% accuracy case. One needs to be cautious when trading off false negatives and false positives.
The 3 additional sales opportunities identified brings value to the business that more than offsets the inconvenience of more false positives to work through. Some clients more familiar with technology-assisted business processes, see false positive rates of 95%, even 99%, as the healthy norm. In fact, they want false positive rates to be this high, especially when monitoring for compliance. Because false negatives are unacceptably risky for the business.
So in this particular case, looking only at statistical accuracy is not only misleading, it’s just analytically fallacious.
While this conversation was purely hypothetical, you may be surprised to learn how many senior executives do not understand this math. I don’t blame them though. The confusion matrix is confusing, and we’re all at the mercy of logical fallacies like the Base Rate Fallacy illustrated here.
This is why I advocate gaining comfort around mathematics. It makes you a better thinker, which will lead to better decisions. Who knows, maybe one day you’ll be put in a position to make decisions with millions of dollars, or even millions of lives on the line. And if not, it never hurts to become a better thinker.
Going further: Area under ROC curve
Perhaps this is a tad too technical for a general audience, but worth a mention. An alternative to raw statistical accuracy is the Area Under the Receiver Operating Characteristic (AUROC) curve. Unlike statistical accuracy, AUROC is not affected by the changes in the distribution of the dataset of the underlying ‘items’ you’re trying to classify.
Thanks for reading!
If you found this helpful, you can return the love by either:
- buying me a coffee
- purchasing books via the links in my book list
- signing up for a free Audible trial (great for walks and drives)
- signing up for a free Kindle unlimited trial (recommend investing in a Kindle if physical books aren’t for you)
- signing up for a free Amazon Prime trial to help reduce your book costs
- sharing this post with your friends
Drop your email below for exclusive subscriber-only TLDR summaries of new posts and more.