This post is the fifth post of the series on Causal Machine Learning. As stated before, the starting point for all causal inference is a causal model. Usually, however, we don’t have a good causal model in hand. This is where Causal Discovery can be helpful. In this post Causal Discovery will be discussed in detail.  As always i will try to keep the things as simple as possible. So stay with me , Enjoy reading !

What is Causal Discovery?

Causal inference focuses on estimating the causal effect of a specific intervention or exposure on an outcome.
Causal discovery focuses on identifying the underlying causal relationships between variables in a system.

Causal inference, which aims to answer questions involving cause and effect. As stated before, the starting point for all causal inference is a causal model. Usually, however, we don’t have a good causal model in hand. This is where Causal Discovery can be helpful.

Causal discovery aims to infer causal structure from data. In other words, given a dataset, derive a causal model that describes it.

Finding causal relationships is one of the fundamental tasks in science. A widely used approach is randomized experiments. For example, to examine whether a recently developed medicine is useful for cancer treatment, researchers recruit subjects and randomly divide subjects into two groups. One is the control group, where the subjects are given placebo, and the other is the treatment group, where the subjects are given the newly developed drug. The reason of randomization is to remove possible effects from confounders. For example, age can be one of the possible confounders which affects both taking the drug or not and the treatment effect. Thus, in practical experiments, we should keep the distribution of ages in the two groups almost the same.

How Does It Work ?

However, in many cases, randomized experiments are very expensive and hard to implement, and sometimes it may even involve ethical issues. In recent decades, inferring causal relations from purely observational data, known as the task of causal discovery, has drawn much attention in machine learning, philosophy, statistics, and computer science.

Causal discovery is an example of an inverse problem. This is like predicting the shape of an ice cube based on the puddle it left on the kitchen counter. Clearly, this is a hard problem, since any number of shapes could generate the same puddle. Connecting this to causality, the puddle of water is like statistical associations embedded in data, and the ice cube is the like underlying causal model.

Causal Discovery Assumptions & Properties

The usual approach to solving inverse problems is to make assumptions about what you are trying to uncover. This narrows down the possible solutions and hopefully makes the problem solvable. There are four common assumptions made across causal discovery algorithms.

👉 Acyclicity — Causal structure can be represented by DAG (G)

👀 Markov Property — All nodes are independent of their non-descendants when conditioned on their parents

🙃 Faithfulness — All conditional independences in true underlying distribution p are represented in G

👍 Sufficiency — Any pair of nodes in G has no common external cause

Although these assumptions help narrow down the number of possible models, they do not fully solve the problem. This is where a few tricks/tests for causal discovery are helpful. There is no single method for causal discovery that dominates all others. Although most methods use the assumptions above (perhaps even more), the details of different algorithms can vary tremendously. A taxonomy of algorithms based on the following tricks is given in the figure below.

causaldiscoveryalgo

Conditional Independence Testing

One of these earliest causal discovery algorithms is the PC algorithm named after its authors Peter Spirtes and Clark Glymour. This algorithm (and others like it) use the idea that two statistically independent variables are not causally linked. The PC algorithm is illustrative of this first trick. An outline of the algorithm is given in the figure below.

pcalgo

The first step is to form a fully connected, undirected graph using every variable in the dataset. Next, edges are deleted if the corresponding variables are independent. Then, connected edges undergo conditional independence testing e.g. independence test of the bottom and far right node conditioned on the middle node in the figure above (step 2).

If conditioning on a variable kills the dependence, that variable is added to the Separation set for those two variables. Depending on the size of the graph, conditional independence testing will continue (i.e. condition on more variables) until there are no more candidates for testing.

Next, colliders (i.e. X → Y ← Z) are oriented based on the Separation set of node pairs. Finally, the remaining edges are directed based on 2 constraints, 1) no new v-structures, and 2) no directed cycles can be formed.

Greedy Search of Graph Space

A greedy search is a way to navigate a space such that you always move in a direction that seems most beneficial based on the local surroundings. Although greedy searches cannot guarantee an optimal solution, for most problems the space of possible DAGs is so big that finding a true optimal solution is intractable. The Greedy Equivalence Search (GES) algorithm uses this trick. GES starts with an empty graph and iteratively adds directed edges such that the improvement in a model fitness measure (i.e. score) is maximized. An example score is the Bayesian Information Criterion (BIC)

Exploiting Asymmetries

A fundamental property of causality is asymmetry. A could cause B, but B may not cause A. There is a large space of algorithms that leverage this idea to select between causal model candidates.

Functional asymmetry assumes models that better fit a relationship are better candidates. For example, given two variables X and Y, the nonlinear additive noise model (NANM) performs a nonlinear regression between X and Y, e.g. y = f(x) + n, where n = noise/residual, in both directions. The model (i.e. causation) is then accepted if the potential cause (e.g. x) is independent of the noise term (e.g. n).

Conclusion

There is no way I could fit a comprehensive review of causal discovery in a short blog post. Despite being young, causal discovery is a promising field that may help bridge the gap between machine and human knowledge.

References

Causal Deep Learning