Consider the following situation,
You are looking for a hospital for the treatment of someone elderly in your family. There are two prominent choices available, Hospital A and Hospital B. For the last 1000 patients those which got treated from Hospital A had a survival rate of 90% whereas those which got treated from Hospital B had survival rate of 80%. So, Hospital A seems to be a clear winner right? Well it may not be the case.
One shouldn’t ignore the fact that all the patients which arrive at the hospital do not have same health level. For example we can classify the patients having good health and bad health repectively. Let the number of survival rates of the patients considering the above classification scheme be as follows:
- Hospital A - 900 in good health out of which 830 survived, 100 in poor health out of which 30 survived
- Hospital B - 600 in good health out of which 590 survived, 400 in poor health out of which 210 survived
Interestingly, the survival rates of the patients having poor health is 52.5% in B and 30% in A. Amazingly, the survival rates of the patients having good health for Hospital A is 92.22 % whereas the same for Hospital B is 98.33%. Turns out that Hospital B is a clear winner and that too convincingly.
The above is an example of Simpson’s paradox which occurs when the aggregated data hides a conditional variable, which is hidden additional factor that significantly influences results.
Statistics are persuasive. So much so that people, organizations, and whole countries base some of their most important decisions on organized data. But any set of statistics might have something lurking inside it that can turn the results completely upside down.
Simpson’s paradox isn’t hypothetical. You can look at Mark Liddel’s video for more examples.
The above is one of the many things about statistics which I found very fascinating.