Chapter 2 Developing a DAG

Causal diagrams represent the data generation process that creates the data that we observe (Huntington-Klein, 2022). We need to think through this process and jot down all of the potential nodes and pathways. Next, we need to simpify the process to see what is relevant

2.1 Relevant Variables in the Data Generation Process

Our focus of the directed acyclic graph (DAG) is to include everything relevant to our research question of interest into the DAG. Huntington-Klein (2022) uses the example of online classes (treatment) and dropping out (outcome).

Our treatment will be Online Class and our outcome will be Dropout

DAG 8

What else should we includee?

Demographics
1. Race
2. Sex/Gender
3. Age
4. Socioeconomic Status
Available Time
Work Hours
Internet Access
Academics
Location
Unobservables
1. Preferences

For our direction of our arrows among our covariates, treatment, and outcomes.

Online Classes cause Drop Outs
Prefernces cause Online Classes
Race causes Preferences, DropOut, Available Time, Work Hours, and Unobservable causes Race to be related Academics and SES
Sex causes Preferences, DropOut, Available Time, Work Hours, and Unobservable causes Sex to be related to Academics and SES
Age causes Preferences, DropOut, Available Time, and Work Hours
SES causes Preferences, DropOut, Internet Access, Available Time, and Work Hours
Available Time causes Online Classes and DropOut
Work Hours cause Available Time
Internet Access cause Online Classes
Academics cause DropOut, Work Hours, Unobservable cause Academics to be related to Race and Sex 11 Location causes Internet Access, Unobservable causes Location related to SES

DAG 9

So our first DAG is complicated and a bit of the mess. However, we have thought through the data generation process.

2.2 Simplify

The data generation process is complex, and developing a complex DAG will prevent you and the reader from understanding the data generation process. So, we should determine what is helpful and what is unhelpful in the data generation process. Simplification is a balancing process. We want to have a clear and concise DAG, but simultaneously we want a DAG that represents the data generation process (Huntington-Klein, 2022).

Unimportant
Redundancy
Mediators
Irrelevance

Unimportant: If you include a variable that is likely small or has unimportant effects, then you probably can leave it out of your DAG.

What about having a quiet cafe nearby? This be enticing for someone wanting to take an online class, but it is likely irrelevant, unimportant, or maybe in Location or Preferences

Redundancy: If there are any variables in your DAG within the same space, such that they have arrows coming in or going out of to or from the same variable, then we can combine (e.g.: race and ethnicity, sex \(\rightarrow\) SES).

There may be some redundancies that we can eliminate to simplify our DAG. Since Sex and Race have similar arrows going to Work Hours, Available Time, Preferences, and DropOut, we can combine these into one node for simplicity and call it Demographics

Mediators: If there is a mediator in your DAG that is only there as a way from treatment to affect outcome. Such that, our mediator, \(B\), in \(A \rightarrow B \rightarrow C\) is the only away for \(A\) to affect \(C\), then we can drop \(B\) and only include \(A \rightarrow C\).

Preferences appears to be a mediator among Age, Race, Sex, and SES onto Online Classes. We can just have our four factors affect Online Classes directly.
Internet Access is also a mediator along the path of \(Location \rightarrow InternetAccess \rightarrow OnlineClassess\), and it is not a variable of interest in our research design. Therefore, we can drop it for simplification.
Work Hours affects Online Classes through Available Time. However, we can simplify and have Work Hours affect Online Classes directly. While Academics does not directly affect Available Time, it does affect it through Work Hours, so this won’t be a problem.

Irrelevance: Some variables are relevant for the data generation process, but not for our research design. If a variable is not on either a front door path or back door path, then we can likely remove this variable from the DAG.

DAG 10

2.3 No Cylces

In the name is “acyclic”, which means there are no cycles. You should not start at one path and end up back where you started (Huntington-Klein, 2022). Why? Because, we will never be able to identify the causal effect.

Examples where we cannot identify the causal effect due to cycles:

\(A \rightarrow B \rightarrow A\)
\(A \rightarrow B \rightarrow C \rightarrow A\)

Problematic! Stuck in a loop.

dag3<-dagify(B ~ A, C ~ B, A ~ C,exposure="A",outcome="B",
             coords=list(x=c(A=0,B=1,C=0),
                         y=c(A=0,B=0,C=1)))
ggdag_status(dag3)+theme_dag()

DAG 11

Problematic! Stuck in a loop.

dag3<-dagify(B ~ A, A ~ B, exposure="A",outcome="B",
             coords=list(x=c(A=0,B=1),
                         y=c(A=0,B=0)))
ggdag_status(dag3)+theme_dag()

DAG 12

We cannot identify the causal effect in either scenario?

2.3.1 Solution

We can add a time dimension and the cycle is now gone. The arrows will now move in only one direction (Huntington-Klein, 2022).

dag3<-dagify(B2 ~ A1, A2 ~ B1,
             coords=list(x=c(A1=0,B1=0,A2=1,B2=1),
                         y=c(A1=1,B1=0,A2=1,B2=0)))
ggdag_status(dag3)+theme_dag()

DAG 13

Here we have \(A_{t} \rightarrow \ B_{t+1}\) and \(B_{t} \rightarrow A_{t+1}\), and we do not get stuck in a loop.

We can also add an instrument of exogenous variation to break the cycle, as well (Huntington-Klein, 2022).

2.4 Our Assumptions

What we do not show on our directed acyclic graph shows our assumptions. We cannot show everything in the data generation process, due to irrelevance, simplicification, etc. Our assumptions may not be dicotomous, such that are assumption are completely true or completely false. Our assumptions are likely true or likely false. It is our job to convince readers that our assumptions are true or our assumptions hold (Huntington-Klein, 2022).