Causal Inference Workshop - Causal Inference & DAGs

This workshop is plagiarism!!

almost all of this content comes from Statistical Rethinking, a textbook and online course that is completely free and available by Richard McElreath
a good portion also comes from The Book of Why and other works by Judea Pearl
there are many scholars, ecologists and otherwise who use this method and explain it better than I ever will - resources at the end

Let’s not panic

the beauty of causal inference is that it relies on concepts that come very naturally to the human brain and is founded on using the expert scientific knowledge that every scientist brings to their studies
THIS DOES NOT CHANGE EVERYTHING - just gives you a framework to easily express what you already feel and know

What is causal inference?

the study of causes and effects: does X cause a change in Y?
is this different from correlation?
- a rooster cawing is highly correlated with the sun rise - did the rooster cause the sun to rise?
- https://www.tylervigen.com/spurious-correlations

Why don’t we talk or learn about causation?

Pearson & Galton, founders of modern statistics, failed in creating the tools needed for causal inference and subsequently decided that it was impossible and “unscientific”
- they used their enormous influence to teach generations of scientists this and attack anyone who opposed them
Judea Pearl invented the math required to answer causal questions only ~ 40 years ago! Science is slow!
causation is not controversial - we are just transitioning

What is causal inference NOT?

prediction!! forecasting!!

if we want to use our models to estimate data in places or times that we do not have data for, but we DO NOT CARE about the relationships between the things in our model, that is prediction and not causal inference
prediction is cool!! it is however, not what we do in our lab (for now…)
AIC is a tool for measuring the predictive power of your model - it is not appropriate for our purposes

Who uses causal inference?

Ecology: Arif & MacNeil 2022, Siegel & Dee, 2025, Laubach et al, 2021
Public health: Glass et al, 2013, Matthay & Glymour, 2022
Sociologists & ethnography: Knight & Winship 2013, Brett & Silver 2024, Snodgrass et al 2024
Machine learning
Anyone who has observational data (and sometimes experimental data) and wants to understand a cause and effect relationship in their system

When do people use causal inference?

Level 1: association

how are the variables related? how does changing X shift my belief in Y?
example: what does canopy cover tell us about air temperature?

When do people use causal inference?

Level 2: intervention

what would Y be if i do X?
example: how will bird species richness increase if i move from a park to a backyard?

When do people use causal inference?

Level 3: counterfactuals

what if X had not occurred? is it X that caused Y?
example: would survey respondents prefer different green space features if they lived in a different borough?

How do I do causal inference?

DAGs (directed acyclic graphs)!
arrows indicate a causal relationship from one variable to another
use your expert knowledge + literature to outline your system with your hypotheses and assumptions (you already make assumptions now, you just don’t visualize them!)
adjust your statistical test (e.g., model) using your DAG

Why do DAGs matter?

putting everything in your model does not test the relationship(s) you are interested in

complex systems have confounders that mislead us and that we need to adjust for
adjustments are dependent on our DAG and the variable of interest

Confounders: fork

Z is a common cause of both X and Y
X and Y are associated
Once stratified by Z, X and Y have no association

EXAMPLE: the effect of canopy cover on temperature (simplified)

library(ggdag)

fork_dag <- dagify(
  temp ~ canopy + SVF,
  canopy ~ SVF,
  labels = c(
    "temp" = "Temperature",
    "canopy" = "Canopy Cover",
    "SVF" = "Sky View Factor"
  ),
  coords = list(x = c(canopy = -1, SVF = 0, temp = 1), 
                y = c(canopy = 1, SVF = 0, temp = 1)),
  exposure = "canopy",
  outcome = "temp"
)

ggdag(fork_dag, text = FALSE, use_labels = "label") + theme_dag()

how do we know how to adjust our model?

ggdag_adjustment_set(fork_dag, effect = "direct", text = FALSE, use_labels = "label", shadow = TRUE) + 
  theme_dag()

our DAG shows us that if we want to understand the effect of canopy cover on temperature, we need to add sky view factor to our model (“adjust for sky view factor”)
our model may look something like this:

library(lme4)

temp_model <- lmer(temperature ~ canopy + SVF + (1|date),
                   data = temp_df)

Confounders: pipe

what if we want to use the same example but understand the effect of SVF on temperature?
this is now a different type of collider, a pipe!

X and Y are associated
influence of X on Y is transmitted through Z
Once stratified by Z, X and Y have no association

the effect of SVF on temperature is both a direct effect and an effect that pipes through canopy
if we want to understand the total effect of sky view factor on temperature, we DO NOT add canopy to our models

ggdag_adjustment_set(pipe_dag, effect = "total", text = FALSE, use_labels = "label", shadow = TRUE) + 
  theme_dag()

there is no adjustment set! everything rests “unadjusted”
because the effect of SVF goes through canopy, adding canopy to your model blocks that effect
your model might look like this:

temp_model <- lmer(temperature ~ SVF + (1|date),
                   data = temp_df)

Confounders: collider

X and Y are not associated (share no causes)
X and Y both influence Z
Once stratified by Z, X and Y appear associated

EXAMPLE: public and private tree species richness

both public tree species richness and private tree species richness contribute to the urban forest’s functional diversity
however, public tree species richness does not influence private tree species richness and vice versa

BUT if we have a model for private tree species richness with both urban forest functional diversity and public tree species richness included, an association between public and private species richness will appear when it does not truly exist
“spurious correlation”

Confounders: descendant

descendant can be different depending on what it is attached to
X and Y are causally associated through Z
A holds information about Z
if stratified by A, X and Y are less associated

we know from before that we shouldn’t add proportion of native trees to our model if we want to test the effect of land use type on bird behaviour because its a pipe
because proportion of invasive trees is a descendant, adding it to the model will have the same effect but weaker
descendants can be used as proxies for our variables of interest

Adjusting your models for confounders

to test the effect of X on Y, we need to identify which variables we need to adjust (aka add to the model) to block all confounding paths
confounders are complex when there are more than 4 or 5 variables in the system
to figure out what variables you need to adjust, we can use dagitty!

Some DAG notes / a petit sermon

variables that do not have shared causes in your system do not need to be included - your DAG does not need to include every variable in the world
do NOT exclude variables just because you haven’t measured them, these are still potential confounders and need to be part of your DAG!
you are an expert with good intuition and expertise, don’t be scared to put your assumptions down on paper
presenting the assumptions you are making about your system is good, transparent science and allows the development of the field
- !! you are doing this anyways !! when you decide what variables to collect / what to include in your models, you are just being less transparent about it! we must always do our best and be brave!

Table 2 fallacy

each variable has its own set of adjustments it needs in order to test its effects
therefore, not all coefficients in a summary table are causal relationships
- check your DAG!
this is described as the “Table 2 Fallacy” - we can only interpret the effect(s) that we have adjusted for, not everything in our summary table
The table 2 fallacy: presenting and interpreting confounder and modifier coefficients

Causal Inference & DAGs

This workshop is plagiarism!!

Let’s not panic

What is causal inference?

Why don’t we talk or learn about causation?

What is causal inference NOT?

Who uses causal inference?

When do people use causal inference?

When do people use causal inference?

When do people use causal inference?

How do I do causal inference?

Why do DAGs matter?

Confounders: fork

Confounders: pipe

Confounders: collider

Confounders: descendant

Adjusting your models for confounders

Some DAG notes / a petit sermon

Table 2 fallacy

dagitty.net - crowd-sourced example

Resources