This post is the second post of the series on Causal Machine Learning. In the last blog post i have given a brief introduction on Causal Machine Learning. In this we will uncover the WHY !. This blog post is based on the work of Judea Pearl. I will discuss how naive statistics can fail (Spurious Correlations , Simpson Paradox & Asymmetry In Causal Inference). As always i will try to keep the things as simple as possible. So stay with me , Enjoy reading !
How (Naive) Statistics Can Fail Us !
1. Spurious Correlations
Correllation Is Not Causation !!!
I know you are bored of this like every other, hearing this oft-repeated saying “correlation does not imply causation”. I will formally illustrate why this is in this post.
In the website, “Spurious Correlations” by Tyler Vigen, we can explore a wide variety of statistical correlations (that are due to chance) with no causal implications. One of them is listed below.
We clearly know No of people who drowned while in a swimming pool has nothing to do with the power generated by US nuclear power plants !!. You can find more interesting “Spurious Correlations” from Tyler Vigen website.
2. Simpson's Paradox
Spurious correlations are well-known in statistics, so it’s easy (Somewhat !) to be on the lookout for it. Lets see a little known paradox in statistics.
Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. Lets look at an example of Simpson's paradox from Wikipedia itself.
In both 1995 and 1996, Justice had a higher batting average (in bold type) than Jeter did. However, when the two baseball seasons are combined, Jeter shows a higher batting average than Justice.
Same data give contradictory conclusions depending on how you look at them! .
Simpson’s paradox highlights that how you look at your data matters. So the question becomes, how do we partition data? Although there is no standard method in statistics for this, causal inference provides a formalism to handle this problem. It all boils down to causal effects, which quantify the impact a variable has on another variable after adjusting for the appropriate confounders.
Let us look at an example from “The Book of Why: The New Science of Cause and Effect” by Judea Pearl , and a post the legend himself posted in his twitter handle.
Consider the below study that measures weekly exercise and cholesterol in various age groups. When we plot exercise on the X-axis and cholesterol on the Y-axis and segregate by age, as in left side of Fig , we see that there is a general trend downward in each group; the more young people exercise, the lower their cholesterol is, and the same applies for middle-aged people and the elderly. If, however, we use the same scatter plot, but we don’t segregate by gender (as in right side of Fig), we see a general trend upward; the more a person exercises, the higher their cholesterol is.
Excercise appears to be beneficial in each age group but harmful in the population as a whole !!!
To resolve this problem, we once again turn to the story behind the data. If we know that older people, who are more likely to exercise are also more likely to have high cholesterol regardless of exercise, then the reversal is easily explained, and easily resolved. Age is a common cause of both treatment (exercise) and outcome (cholesterol). So we should look at the age-segregated data in order to compare same-age people and thereby eliminate the possibility that the high exercisers in each group we examine are more likely to have high cholesterol due to their age, and not due to exercising.
However, please do not get confused , segregated data does not always give the correct answer.
Lets look at another example from the Causal Inference In Statistics book by Pearl.
In the classical example used by Simpson (1951), a group of sick patients are given the option to try a new drug. Among those who took the drug, a lower percentage recovered than among those who did not. However, when we partition by gender, we see that more men taking the drug recover than do men are not taking the drug, and more women taking the drug recover than do women are not taking the drug!
In other words, the drug appears to help men and women, but hurt the general population. It seems nonsensical, or even impossible—which is why, of course, it is considered a paradox. Some people find it hard to believe that numbers could even be combined in such a way.
The data seem to say that if we know the patient’s gender male or female we can prescribe the drug, but if the gender is unknown we should not! Obviously, that conclusion is ridiculous. If the drug helps men and women, it must help anyone; our lack of knowledge of the patient’s gender cannot make the drug harmful.
Given the results of this study, then, should a doctor prescribe the drug for a woman? A man? A patient of unknown gender? Or consider a policy maker who is evaluating the drug’s overall effectiveness on the population. Should he/she use the recovery rate for the general population? Or should he/she use the recovery rates for the gendered sub-populations?
The answer is nowhere to be found in simple statistics.
In order to decide whether the drug will harm or help a patient, we first have to understand the story behind the data , the causal mechanism that led to, or generated, the results we see. For instance, suppose we knew an additional fact: Estrogen has a negative effect on recovery, so women are less likely to recover than men, regardless of the drug. In addition, as we can see from the data, women are significantly more likely to take the drug than men are. So, the reason the drug appears to be harmful overall is that, if we select a drug user at random, that person is more likely to be a woman and hence less likely to recover than a random person who does not take the drug. Put differently, being a woman is a common cause of both drug taking and failure to recover. Therefore, to assess the effectiveness, we need to compare subjects of the same gender, thereby ensuring that any difference in recovery rates between those who take the drug and those who do not is not ascribable to estrogen.
In fact, as statistics textbooks have traditionally (and correctly) warned students, correlation is not causation, so there is no statistical method that can determine the causal story from the data alone. Consequently, there is no statistical method that can aid in our decision.
3. Symmetry
The problems with traditional statistics when thinking about causality stems from a fundamental property of algebra, symmetry . The left-hand side of an equation equals the right-hand side (that’s the point of algebra). The equal sign implies symmetry. However, causality is fundamentally asymmetric i.e. causes lead to effects and not the other way around.
This distinction further implies that causal relations cannot be expressed in the language of probability and, hence, that any mathematical approach to causal analysis must acquire new notation – probability calculus is insufficient. To illustrate, the syntax of probability calculus does not permit us to express the simple fact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified by the conditional probability P(disease|symptom) from causal dependence, for which we have no expression in standard probability calculus.
Let’s look at a simple example taken from Judea Pearl book itself . Suppose we model the relationship between a disease and the symptoms it produces, with the expression below. Y represents the severity of the symptoms, X the severity of the disease, m is the connection between the two, and b represents all other factors.
Using the rules of algebra we can invert the equation above to get the following expression.
Here’s the problem, if we interpret the first equation as diseases cause symptoms, then we have to interpret the second equation as symptoms cause diseases! Which is of course not true.
Note : Linear relations are used here for illustration purposes only; they do not represent typical disease-symptom relations but illustrate the historical development of path analysis.
Why Association (Correlation) Is Not Causation ?
Before moving to the next set of blog posts , I should precisely define what a correlation is. I know you all are bored with listening this oft-repeated saying "Correlation is not causation" , so am i. So lets sort this out before moving to anything else!
Before moving ahead lets clarify one more thing : “Correlation” is often colloquially used as a synonym for statistical dependence. However, “correlation” is technically only a measure of linear statistical dependence. We will largely be using the term association to refer to statistical dependence from now on.
Lets take an example from Brady Neal's Causal Course book.
Say you happen upon some data that relates wearing shoes to bed and
waking up with a headache, as one does. It turns out that most times
that someone wears shoes to bed, that person wakes up with a headache.
And most times someone doesn’t wear shoes to bed, that person doesn’t
wake up with a headache. It is not uncommon for people to interpret
data like this (with associations) as meaning that wearing shoes to bed
causes people to wake up with headaches, especially if they are looking
for a reason to justify not wearing shoes to bed.
We can explain how wearing shoes to bed and headaches are associated
without either being a cause of the other. It turns out that they are
both caused by a common cause: drinking the night before. This kind of variables are called "confounder" or lurking variable. We will call this kind of association confounding association since the association is facilitated by a confounder.
The main problem motivating causal inference is that association is not causation.
If the two were the same, then causal inference would be easy. Traditional statistics and machine learning would already have causal inference solved, as measuring causation would be as simple as just looking at measures such as correlation and predictive performance in data.
Lets look at another example. Lets attempt to determine the causal effect of vitamin C intake on resistance to sickness. Let X be defined as a binary indicator representing if this subject intakes vitamin C and let Y be a binary indicator of being healthy (not getting sick). X is also referred to as the ‘treatment’ in a more general setting. Now, let C1 be the value of Y if X=1 (vitamin C is taken) and C0 be the value of Y if X=0 (vitamin C is not taken). We call C0 and C1 the potential outcomes of this experiment.
For a single person, the causal effect of taking vitamin C in this context would be the difference between the expected outcome of taking vitamin C and the expected outcome of not taking vitamin C.
Causal Effect = E(C1) – E(C0)
Unfortunately, we can only ever observe one of the possible outcomes C0 or C1. We cannot perfectly reset all conditions to see the result of the opposite treatment. Instead, we can use multiple samples and calculate the association between Vitamin C and being healthy.
Association = E(Y|X=1) – E(Y|X=0)
Association as being (1+1+1+1)/4 – (0+0+0+0)/4 = 1
Causal effect, using the unobserved outcomes*, as being (4*0 + 4*1)/4 – (4*0 + 4*1)/4 = 0
We just calculated that, in this case, association does not equal causation. Observationally, there seems to be a perfect association between taking Vitamin C intake and being healthy. However, we can see there is no causal effect because we are privileged with the values of the unobserved outcomes. This inequality could be explained by considering that the people that stayed healthy practiced healthy habits which included taking Vitamin C.
Okay, one more motivating example :
In response to a large study that studied the relationship between income and life expectancy, Vox published an article titled “Want to live longer, even if you’re poor? Then move to a big city in California” (Klein, 2016). However, as is implied by the title of the study “The Association Between Income and Life Expectancy in the United States, 2001-2014”, the study did not presume to make this recommendation and in fact the closest statement made to the Vox recommendation was “… the strongest pattern in the data was that low-income individuals tend to live longest (and have more healthful behaviors) in cities with highly educated populations, high incomes, and high levels of government expenditures, such as New York, New York, and San Francisco, California.” (Chetty et al., 2016).
Similar to the example regarding vitamin C and health, this study only found associative effects. However, just like it is incorrect to say that vitamin C causes a person to be healthy, it is also incorrect to say that moving to California will cause you to live longer.
That's all for now ! , Hope you enjoyed reading so far !
In the next blog post, we will further investigate the differences between association and causation, by starting with Pearl’s three-level causal hierarchy. That will be much interesting to watch out for !