Base Rate Fallacy

Mon 10 March 2014

In an increasingly data-driven world, it is important to understand how to interpret certain numbers that are often quoted without much context. One such case is medical test statistics; medical practitioners often tell patients how sensitive a test is without giving context on the base rate of occurrence of the ailment to be diagnosed. Even if they provide that rate, they often fail to make the connection between the two. I don't presume malice or incompetence in such cases, but rather attribute this failure to the difficulty of conveying statistical information to a diverse set of population which might not be numerically disposed. Hopefully this post helps make the connection.

Say the sensitivity of a medical test for ailment X is 80%. That means that if 100 patients with ailment X were tested using this method, the test would correctly diagnose about 80 of them. Unfortunately, the other 20 would be falsely identified as negative for the ailment.

Now, let's say that about 0.5% of the general population suffers from ailment X. If you were identified as positive using the medical test above, what is the probability that you actually have ailment X?

This step confounds many since it's easy to guess 80% (since the test is 80% 'accurate'). But that would be wrong. The correct answer is it depends.

What does it depend on?

It depends on whether the medical test controls for false discovery rate. Let's assume that it does not and the test assumes a p-value of less than 0.05 (quite common in academic literature) as strong enough evidence to reject the null hypothesis (it could also consider a 5% false discovery rate as acceptable).

Now say the general population consists of a 1000 people and all of them are subjected to the medical test. About 5 of them will actually have ailment X (base rate = 0.5%). The medical test, being 80% sensitive, will correctly identify about 4 of them. Out of the remaining 995 that do not have the ailment, it will also identify about 50 (5% of 995) as positive. Why? Because of the chosen p-value.

Thus a total of 54 people will be diagnosed as having ailment X. Thus, the probability that you have ailment X if you are positively identified by the test is 7.4% (4/54). This is starkly in contrast with 80% on which most patients would anchor.

The take-away from this post should be that as a consumer, we need to be vigilant and constructively skeptical as data-driven insights become more and more prevalent. While it would be counter-productive to reject every result (or advice) unless you've verified the experiment yourself, it is important to ask the right questions and know that there's always room for error.

If you're interested in reading more about the topic, I highly recommend this article.