Causal Simulation Exemplar

Herdiantri Sufriyana

Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.Department of Medical Physiology, Faculty of Medicine, Universitas Nahdlatul Ulama Surabaya, Surabaya, Indonesia.
herdi@tmu.edu.tw

Emily Chia-Yu Su

Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan.

2024-05-18

Introduction to Causal Graphs

To generate causally-simulated data effectively, it is essential to understand how to represent cause-and-effect relationships using causal graphs. We use the igraph package for creating and manipulating network structures and the ggnetwork package for visualizing these networks within the ggplot2 framework of the tidyverse package. Below is how to load these libraries in R:

library(igraph)
library(ggnetwork)
library(tidyverse)

We will use \(X\) and \(Y\) respectively to denote a cause and an effect. Both are represented as vertices in a graph. If there is a cause-and-effect relationship between both, then an edge is drawn from \(X\) to \(Y\). To do this, we need a 2-column data frame: (1) from, and (2) to.

# Create a data frame for the X and Y relationship
d <- data.frame(from = "X", to = "Y")

from	to
X	Y

We convert the data frame into an igraph object as a directed graph.

# Convert the data frame into an igraph object.
g <- graph_from_data_frame(d, directed = TRUE)
print(g)

## IGRAPH ae95e2f DN-- 2 1 -- 
## + attr: name (v/c)
## + edge from ae95e2f (vertex names):
## [1] X->Y

To visualize the graph, we need to automatically determine the coordinates to plot \(X\) and \(Y\) and draw a line from \(X\) to \(Y\).

# Lay out the graph as a tree
g_layout <- layout_as_tree(g)

# Determine the coordinates with ggnetwork
set.seed(1)
g_coord <- ggnetwork(g, layout = g_layout)

	x	y	name	xend	yend
2	0.5	1	X	0.5	0.025
1	0.5	1	X	0.5	1.000
21	0.5	0	Y	0.5	0.000

To draw the graph, we utilize ggplot for its flexibility in customizing graph aesthetics. Use closed and curved arrows to clearly indicate the direction of the causal effect and helps prevent overlapping of arrows.

We may need to make the tree layout horizontal. This orientation helps in visualizing the causal flow from left to right, making it easier to follow.

# Define the plot area
g_plot <- ggplot(g_coord, aes(x, y, xend = xend, yend = yend))

# Draw edges with closed, curved arrows to emphasize direction
g_plot <- g_plot + geom_edges(arrow = arrow(type = "closed"), curvature = 0.15)

# Add node labels
g_plot <- g_plot + geom_nodelabel(aes(label = name))

# Make the tree layout horizontal
g_plot <- g_plot + coord_flip()
g_plot <- g_plot + scale_y_reverse()

# Apply a minimal theme
g_plot <- g_plot + theme_void()

# Display the graph
print(g_plot)

Vertex and edge

A variable is represented as a vertex, and a path is represented as 1 or more edges. Two variables may have a path between them. It means \(X\) and \(Y\) are dependent.

It is also possible for two variables to have no path between them. It means \(X\) and \(Y\) are independent.

Path

If two variables have a path between them, there are four possible paths. First, a path consists of a single edge, which we call a causal path.

Second, a path consists of more than one edges with 1 or more mediators, which we call a mediator path. A mediator is connected by two edges: one from \(X\) to \(M\) and another from \(M\) to \(Y\).

Third, a path consists of more than one edges with 1 or more confounders, which is called a confounder path. A confounder is connected by two edges: one to \(X\) and another to \(Y\).

Fourth, a path consists of more than one edges with 1 or more colliders, i.e., a collider path. A collider is connected by two edges coming from \(X\) and \(Y\) to \(K\).

Correlation and causation

We will empirically test these theoretical distinctions by generating causally-simulated data and conducting regression analysis. For this, we utilize the rcausim package for data generation and the broom package to tidy up the results from our correlation analysis.

library(rcausim)
library(broom)

Causal and mediator paths allow both

We start by defining a Functions class based on the specified causal structure in our previous data frame d. This class will help specify the required arguments for each variable’s function.

path1_functions <- function_from_edge(d)
print(path1_functions)

## 0 / 2  vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(X)

Next, we define a function for \(X\) that generates a normally distributed numeric vector of length n.

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

For Y, we create a linear function that is dependent on X for \(Y\).

function_Y <- function(X){
  Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(Y)
}

We now define path1_functions using function_X and function_Y.

path1_functions <- define(path1_functions, which = 'X', what = function_X)
path1_functions <- define(path1_functions, which = 'Y', what = function_Y)
print(path1_functions)

## 2 / 2  vertices have functions.

Generate data using path1_functions.

set.seed(1)
path1_data <- data_from_function(path1_functions, n = 25000)

X	Y
-0.6264538	-0.2122441
0.1836433	0.1897219
-0.8356286	-0.3119980
1.5952808	0.8956116
0.3295078	0.2684744
-0.8204684	-0.3078511

Finally, we perform a regression to infer the correlation between \(X\) and \(Y\).

path1_reg <- lm(Y ~ X, data = path1_data)
path1_results <- tidy(path1_reg)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0999709	3.2e-05	3126.557	0
X	0.5000290	3.2e-05	15639.931	0

This output should show the estimated coefficients, their standard errors, \(z\)-values, and \(p\)-values. Results shows that \(X\) has a significant effect on \(Y\) (\(p\)-value <0.05). The estimated coefficients reflect the data generation process, providing a consistency check for our simulation.

For the mediator path involving a mediator \(M\), we create a new data frame d2. This data frame specifies the edges from \(X\) to \(M\) and then from \(M\) to \(Y\).

d2 <- data.frame(from = "X", to = "M")
d2 <- add_row(d2, data.frame(from = "M", to = "Y"))

from	to
X	M
M	Y

A new class of Functions is defined using d2 to establish a path which involves a mediator. This step helps clarify the arguments in each function of \(X\), \(M\), and \(Y\).

path2_functions <- function_from_edge(d2)
print(path2_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## X = function(n)
## M = function(X)
## Y = function(M)

We reuse the same function for \(X\), where \(X\) is generated from a normal distribution.

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

We also define a function for \(M\), which is directly influenced by \(X\).

function_M <- function(X){
  M <- 0.7 * X + rnorm(length(X), mean = 0.3, sd = 0.005)
  return(M)
}

Finally, we define a function for \(Y\) as a linear function of \(M\).

function_Y <- function(M){
  Y <- 0.5 * M + rnorm(length(M), mean = 0.1, sd = 0.005)
  return(Y)
}

We define the path2_functions before generating data.

path2_functions <- define(path2_functions, which = 'X', what = function_X)
path2_functions <- define(path2_functions, which = 'M', what = function_M)
path2_functions <- define(path2_functions, which = 'Y', what = function_Y)
print(path2_functions)

## 3 / 3  vertices have functions.

With path2_functions, we generate data that represent the mediator path.

set.seed(1)
path2_data <- data_from_function(path2_functions, n = 25000)

X	M	Y
-0.6264538	-0.1375349	0.0338620
0.1836433	0.4264506	0.3107876
-0.8356286	-0.2791237	-0.0338706
1.5952808	1.4146678	0.8134096
0.3295078	0.5343759	0.3650638
-0.8204684	-0.2719448	-0.0432266

Another regression is performed to infer the correlation between \(X\) and \(Y\).

path2_reg <- lm(Y ~ X, data = path2_data)
path2_results <- tidy(path2_reg)

term	estimate	std.error	statistic	p.value
(Intercept)	0.2499616	3.52e-05	7110.367	0
X	0.3500053	3.52e-05	9957.264	0

In our data generation process, \(M\) is defined as \(0.7 \times X + 0.3\), and \(Y\) as \(0.5 \times M + 0.1\); therefore, \(Y(X)\) equals \(0.5 \times (0.7 \times X + 0.3) + 0.1\). Solving this equation, we find that the estimated coefficients reflect the data generation process.

Confounder path allows correlation

In the third path, both \(X\) and \(Y\) are influenced by \(C\), as shown in d3 below.

d3 <- data.frame(from = "C", to = "X")
d3 <- add_row(d3, data.frame(from = "C", to = "Y"))

from	to
C	X
C	Y

We clarify the arguments in the functions for each variable based on the data frame d3.

path3_functions <- function_from_edge(d3)
print(path3_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## C = function(n)
## X = function(C)
## Y = function(C)

We define a normal distribution for \(C\).

function_C <- function(n){
  C <- rnorm(n, mean = 0, sd = 1)
  return(C)
}

The value of \(X\) depends on \(C\).

function_X <- function(C){
  X <- 0.7 * C + rnorm(length(C), mean = 0.3, sd = 0.005)
  return(X)
}

Similarly, \(Y\) also depends on \(C\).

function_Y <- function(C){
  Y <- 0.5 * C + rnorm(length(C), mean = 0.1, sd = 0.005)
  return(Y)
}

The functions are connected into path3_functions

path3_functions <- define(path3_functions, which = 'C', what = function_C)
path3_functions <- define(path3_functions, which = 'X', what = function_X)
path3_functions <- define(path3_functions, which = 'Y', what = function_Y)
print(path3_functions)

## 3 / 3  vertices have functions.

Data is generated using path3_functions.

set.seed(1)
path3_data <- data_from_function(path3_functions, n = 25000)

C	X	Y
-0.6264538	-0.1375349	-0.2105975
0.1836433	0.4264506	0.1893839
-0.8356286	-0.2791237	-0.3121231
1.5952808	1.4146678	0.9037161
0.3295078	0.5343759	0.2626297
-0.8204684	-0.2719448	-0.3174884

A regression examines the correlation between \(X\) and \(Y\).

path3_reg <- lm(Y ~ X, data = path3_data)
path3_results <- tidy(path3_reg)

term	estimate	std.error	statistic	p.value
(Intercept)	-0.1142645	4.27e-05	-2676.865	0
X	0.7142049	5.60e-05	12748.446	0

Results indicate a statistically significant correlation between \(X\) and \(Y\), although there is no edge directly from \(X\) to \(Y\). Notably, the coefficients are not identical to those used in the data generation process.

Collider path allows none

In this path, both \(X\) and \(Y\) both influence \(K\) but do not influence each other. This scenario is illustrated by d4.

d4 <- data.frame(from = "X", to = "K")
d4 <- add_row(d4, data.frame(from = "Y", to = "K"))

from	to
X	K
Y	K

A class of Functions is created to clarify the arguments in the functions for \(X\), \(Y\), and \(K\).

path4_functions <- function_from_edge(d4)
print(path4_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(n)
## K = function(X,Y)

The function for \(X\) determines its values to be normally-distributed.

Both \(X\) and \(Y\) are independently defined with normally distributed values.

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

function_Y <- function(n){
  Y <- rnorm(n, mean = 0.1, sd = 0.005)
  return(Y)
}

The value of \(K\) depends on both \(X\) and \(Y\).

function_K <- function(X, Y){
  K <- 0.7 * X + 0.5 * Y + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(K)
}

Connect all functions into path4_functions.

path4_functions <- define(path4_functions, which = 'X', what = function_X)
path4_functions <- define(path4_functions, which = 'Y', what = function_Y)
path4_functions <- define(path4_functions, which = 'K', what = function_K)
print(path4_functions)

## 3 / 3  vertices have functions.

Finally, data are generated using path4_functions

set.seed(1)
path4_data <- data_from_function(path4_functions, n = 25000)

X	Y	K
-0.6264538	0.1009828	-0.2853968
0.1836433	0.0979003	0.2750627
-0.8356286	0.1058163	-0.4263406
1.5952808	0.0979712	1.2717578
0.3295078	0.1037205	0.3803915
-0.8204684	0.1023831	-0.4303905

We then evaluate the correlation between \(X\) and \(Y\) by conducting a regression.

path4_reg <- lm(Y ~ X, data = path4_data)
path4_results <- tidy(path4_reg)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0999709	3.2e-05	3126.5572990	0.0000000
X	0.0000290	3.2e-05	0.9071066	0.3643592

Results shows no significant correlation between X and Y, although there is a path between them.

Causal Discovery

Among the four paths, only causal and mediator paths represent causal relationships, while the others are non-causal. We can guess the causal structure based on these properties using directional (\(d\))-separation. It is a criterion used to determine whether a set of variables, \(Z\), provides conditional independence between the pair of other variables (i.e., \(X\) and \(Y\)) within a causal graph.

Specifically, we need to identify if \(X\) and \(Y\) are dependent, then we condition them on \(Z\). If \(X\) and \(Y\) is conditionally independent, then there is no causal path between them. The rules of \(d\)-separation are:

The path contains a mediator \(X \rightarrow M \rightarrow Y\) or a confounder \(X \leftarrow C \rightarrow Y\) where both \(M\) and \(C\) are in \(Z\); and
The path contains a collider \(X \rightarrow K \leftarrow Y\) where \(K\) is not in Z and none in Z is dependent on \(K\).

Notably, when determining causal relationships, we include \(M\) in \(Z\) only if we aim to identify if there are any causal paths from \(X\) to \(Y\) that do not pass through \(M\).

Now, we create additional versions of the mediator, confounder, and collider paths, each including the causal path. These versions, without and with a causal path, represent scenarios in which the null hypothesis and the alternative hypothesis are respectively accepted.

d2_with_d <- add_row(d2, d)
d3_with_d <- add_row(d3, d)
d4_with_d <- add_row(d4, d)

Mediator without/with causal path

Below we outline the data generation process in which the null hypothesis (left) and the alternative hypothesis (right) are accepted.

A regression was already conducted for the mediator path without a causal path, identified as path2_results. We need to apply the same analysis to the version with a causal path.

path2_d_functions <- function_from_edge(d2_with_d)
print(path2_d_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## X = function(n)
## M = function(X)
## Y = function(M,X)

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

function_M <- function(X){
  M <- 0.7 * X + rnorm(length(X), mean = 0.3, sd = 0.005)
  return(M)
}

function_Y <- function(X, M){
  Y <- 0.25 * X + 0.5 * M + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(Y)
}

path2_d_functions <- define(path2_d_functions, which = "X", what = function_X)
path2_d_functions <- define(path2_d_functions, which = "M", what = function_M)
path2_d_functions <- define(path2_d_functions, which = "Y", what = function_Y)

set.seed(1)
path2_d_data <- data_from_function(path2_d_functions, n = 25000)
path2_d_reg <- lm(Y ~ X, data = path2_d_data)
path2_d_results <- tidy(path2_d_reg)

We condition each version on \(Z\). In this case, it includes \(M\) to identify any causal paths from \(X\) to \(Y\) that do not pass through \(M\).

path2_cond_reg <- lm(Y ~ X + M, data = path2_data)
path2_cond_results <- tidy(path2_cond_reg)

path2_cond_d_reg <- lm(Y ~ X + M, data = path2_d_data)
path2_cond_d_results <- tidy(path2_cond_d_reg)

causal_path	conditioned	term	estimate	std.error	statistic	p.value
No	No	(Intercept)	0.2499616	0.0000352	7110.367085	0.0000000
No	No	X	0.3500053	0.0000352	9957.263573	0.0000000
No	Yes	(Intercept)	0.1044050	0.0018721	55.769205	0.0000000
No	Yes	X	0.0103260	0.0043683	2.363846	0.0180939
No	Yes	M	0.4852360	0.0062400	77.761933	0.0000000
Yes	No	(Intercept)	0.2499616	0.0000352	7110.367085	0.0000000
Yes	No	X	0.6000053	0.0000352	17069.487695	0.0000000
Yes	Yes	(Intercept)	0.1044050	0.0018721	55.769205	0.0000000
Yes	Yes	X	0.2603260	0.0043683	59.594236	0.0000000
Yes	Yes	M	0.4852360	0.0062400	77.761933	0.0000000

The regression results show that the coefficient of \(X\) is reduced from approximately 0.35 to nearly 0 when we condition on \(Z\) (which includes the mediator \(M\)) in the first version. When conditioning the second version on \(Z\), the coefficient of \(X\) is also reduced, from about 0.60 to 0.26; however, the effect of \(X\) on \(Y\) remains significant. This indicates that in the first scenario, conditioning on \(M\) successfully blocks the path from \(X\) to \(Y\), supporting the hypothesis of \(M\) being a complete mediator. In contrast, in the second scenario, \(M\) does not completely mediate the relationship between \(X\) and \(Y\), as \(X\) still has a direct effect on \(Y\).

Confounder without/with causal path

We now apply the data generation process similarly, but in this instance, \(Z\) includes \(C\), as illustrated below.

In addition to path3_results for the first version, we need to conduct a regression analysis for the second version.

path3_d_functions <- function_from_edge(d3_with_d)
print(path3_d_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## C = function(n)
## X = function(C)
## Y = function(C,X)

function_C <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

function_X <- function(C){
  X <- 0.7 * C + rnorm(length(C), mean = 0.3, sd = 0.005)
  return(X)
}

function_Y <- function(X, C){
  Y <- 0.25 * X + 0.5 * C + rnorm(length(C), mean = 0.1, sd = 0.005)
  return(Y)
}

path3_d_functions <- define(path3_d_functions, which = "C", what = function_C)
path3_d_functions <- define(path3_d_functions, which = "X", what = function_X)
path3_d_functions <- define(path3_d_functions, which = "Y", what = function_Y)

set.seed(1)
path3_d_data <- data_from_function(path3_d_functions, n = 25000)
path3_d_reg <- lm(Y ~ X, data = path3_d_data)
path3_d_results <- tidy(path3_d_reg)

Both versions are conditioned on \(Z\) to determine any causal paths from \(X\) to \(Y\) that can be identified when the non-causal path involving \(C\) is blocked.

path3_cond_reg <- lm(Y ~ X + C, data = path3_data)
path3_cond_results <- tidy(path3_cond_reg)

path3_cond_d_reg <- lm(Y ~ X + C, data = path3_d_data)
path3_cond_d_results <- tidy(path3_cond_d_reg)

causal_path	conditioned	term	estimate	std.error	statistic	p.value
No	No	(Intercept)	-0.1142645	0.0000427	-2676.865003	0.000000
No	No	X	0.7142049	0.0000560	12748.446459	0.000000
No	Yes	(Intercept)	0.1044050	0.0018721	55.769205	0.000000
No	Yes	X	-0.0147640	0.0062400	-2.366022	0.017988
No	Yes	C	0.5103260	0.0043683	116.824627	0.000000
Yes	No	(Intercept)	-0.1142645	0.0000427	-2676.865003	0.000000
Yes	No	X	0.9642049	0.0000560	17210.907481	0.000000
Yes	Yes	(Intercept)	0.1044050	0.0018721	55.769205	0.000000
Yes	Yes	X	0.2352360	0.0062400	37.697956	0.000000
Yes	Yes	C	0.5103260	0.0043683	116.824627	0.000000

For the first version, the coefficient of \(X\) is significantly reduced from approximately 0.7 to nearly 0 when we condition on \(Z\) (which includes the confounder \(C\)). In the second version, the coefficient is reduced from 0.96 to 0.24; however, \(X\) still exerts a significant effect on \(Y\).

Collider without/with causal path

Unlike the mediator and confounder paths, the collider path exhibits an opposite result when we condition the effect of \(X\) on \(Y\).

We proceed with a similar analysis.

path4_d_functions <- function_from_edge(d4_with_d)
print(path4_d_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(X)
## K = function(X,Y)

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

function_Y <- function(X){
  Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(Y)
}

function_K <- function(X, Y){
  K <- 0.7 * X + 0.5 * Y + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(K)
}

path4_d_functions <- define(path4_d_functions, which = "X", what = function_X)
path4_d_functions <- define(path4_d_functions, which = "Y", what = function_Y)
path4_d_functions <- define(path4_d_functions, which = "K", what = function_K)

set.seed(1)
path4_d_data <- data_from_function(path4_d_functions, n = 25000)
path4_d_reg <- lm(Y ~ X, data = path4_d_data)
path4_d_results <- tidy(path4_d_reg)

The effect of \(X\) on \(Y\) is also conditioned on \(Z\) which includes \(K\). Notably, this approach violates the rules of \(d\)-separation.

path4_cond_reg <- lm(Y ~ X + K, data = path4_data)
path4_cond_results <- tidy(path4_cond_reg)

path4_cond_d_reg <- lm(Y ~ X + K, data = path4_d_data)
path4_cond_d_results <- tidy(path4_cond_d_reg)

causal_path	conditioned	term	estimate	std.error	statistic	p.value
No	No	(Intercept)	0.0999709	0.0000320	3126.5572990	0.0000000
No	No	X	0.0000290	0.0000320	0.9071066	0.3643592
No	Yes	(Intercept)	0.0397725	0.0007747	51.3413419	0.0000000
No	Yes	X	-0.2809707	0.0036137	-77.7514571	0.0000000
No	Yes	K	0.4014251	0.0051622	77.7619331	0.0000000
Yes	No	(Intercept)	0.0999709	0.0000320	3126.5572990	0.0000000
Yes	No	X	0.5000290	0.0000320	15639.9309158	0.0000000
Yes	Yes	(Intercept)	0.0397725	0.0007747	51.3413419	0.0000000
Yes	Yes	X	0.1186730	0.0049042	24.1980828	0.0000000
Yes	Yes	K	0.4014251	0.0051622	77.7619331	0.0000000

Now, we can observe why including \(K\) in \(Z\) is problematic. In the first version, the coefficient of \(X\) changes from nearly 0 to -0.28, indicating a significant effect on \(Y\). In our data generation process for this scenario, \(X\) and \(Y\) are independent. In the second version, they are dependent, but conditioning on \(Z\) reverses the effect in the same direction observed in the first version. Hence, including \(K\) in \(Z\) reverses the relationship between a pair of other variables in a causal graph.

Causal Effect Estimation

In the previous section, we used different causal structures but similar functions to generate the data, specifically using linear functions. If the linear regression model is correctly specified, then the estimated coefficients should reflect the data generation process. However, what if we use non-linear functions to generate the data? Can the linear regression model, especially one that strictly involves additive terms, still accurately estimate the causal effect when properly specified?

Causal path with a logical “AND” rule

Below we create a data frame d5 which has a simple causal structure, i.e., two causal paths from either \(X1\) and \(X2\) to \(Y\).

d5 <- data.frame(from = "X1", to = "Y")
d5 <- add_row(d5, data.frame(from = "X2", to = "Y"))

We also define a class of Functions to clarify the arguments required for each function.

d5_functions <- function_from_edge(d5)
print(d5_functions)

## 0 / 3  vertices have functions.
## Please define functions for:
## X1 = function(n)
## X2 = function(n)
## Y = function(X1,X2)

For both \(X1\) and \(X2\), we determine their functions to generate data of 0 or 1 with the length n. Meanwhile, \(Y\) is determine by its function to apply a logical “AND” rule: if \(X1\) and \(X2\) are both equal to 1, then \(Y\) is equal to 1; otherwise, it is 0.

Interpreting the correctness of the estimated coefficient is not as intuitive as when using a linear function. Nonetheless, the logical “AND” rule for generating the data can be represented by the linear function:

\[Y = -0.25 + 0.5 \times X1 + 0.5 \times X2\]

If both \(X1\) and \(X2\) are equal to 0, then \(Y\) equals -0.25, rounded to 0;
If only one of \(X1\) and \(X2\) equals 1, then \(Y\) equals 0.25, rounded to 0; and
If both \(X1\) and \(X2\) equal 1, then \(Y\) equals 0.75, rounded to 1.

We conduct a regression using the generated data to estimate the effects of either \(X1\) or \(X2\) on \(Y\).

function_X1 <- function(n){
  X1 <- sample(c(0, 1), n, replace = T)
  return(X1)
}

function_X2 <- function(n){
  X2 <- sample(c(0, 1), n, replace = T)
  return(X2)
}

function_Y <- function(X1, X2){
  Y <- as.numeric(X1 == 1 & X2 == 1)
  return(Y)
}

d5_functions <- define(d5_functions, which = "X1", what = function_X1)
d5_functions <- define(d5_functions, which = "X2", what = function_X2)
d5_functions <- define(d5_functions, which = "Y", what = function_Y)

set.seed(1)
d5_data <- data_from_function(d5_functions, n = 25000)
d5_reg <- lm(Y ~ X1 + X2, data = d5_data)
d5_results <- tidy(d5_reg)

term	estimate	std.error	statistic
(Intercept)	-0.2504165	0.0027403	-91.38299
X1	0.4989603	0.0031625	157.77597
X2	0.4953746	0.0031623	156.64850

The results accurately show the estimated coefficients for \(X1\) and \(X2\). The interpretation is even more intuitive in this special case if we simply compute \(\hat{Y}\) (Y_hat) to predict \(Y\), where \(Y = \hat{Y} + \epsilon\); thus, \(\epsilon\) is the error of the model in predicting \(Y\).

We create test data which simply includes all the possible combinations of \(X1\) and \(X2\). Then, we compute \(\hat{Y}\) using the test data and the regression model fitted to the generated data.

test_data <- expand.grid(X1 = c(0, 1), X2 = c(0, 1))
d5_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d5_test <- mutate(d5_test, Y_hat = predict(d5_reg, newdata = d5_test))
d5_test <- mutate(d5_test, Y_hat = round(Y_hat))

X1	X2	Y	Y_hat
0	0	0	0
1	0	0	0
0	1	0	0
1	1	1	1

As observed, even when using a logical “AND” rule to generate the data, a linear regression model can still accurately estimate the causal effect if the model is correctly specified.

Causal path with a logical “OR” rule

Now, we apply the same approach but using a logical “OR” rule. If either \(X1\) or \(X2\) equals 1, then \(Y\) is set to 1.

d6_functions <- function_from_edge(d5)

function_Y <- function(X1, X2){
  Y <- as.numeric(X1 == 1 | X2 == 1)
  return(Y)
}

d6_functions <- define(d6_functions, which = "X1", what = function_X1)
d6_functions <- define(d6_functions, which = "X2", what = function_X2)
d6_functions <- define(d6_functions, which = "Y", what = function_Y)

set.seed(1)
d6_data <- data_from_function(d6_functions, n = 25000)
d6_reg <- lm(Y ~ X1 + X2, data = d6_data)
d6_results <- tidy(d6_reg)

d6_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d6_test <- mutate(d6_test, Y_hat = predict(d6_reg, newdata = d6_test))
d6_test <- mutate(d6_test, Y_hat = round(Y_hat))

term	estimate	std.error	statistic
(Intercept)	0.2504165	0.0027403	91.38299
X1	0.5010397	0.0031625	158.43351
X2	0.5046254	0.0031623	159.57383

As previously described, we can also represent the logical “OR” rule with the linear function:

\[Y = 0.25 + 0.5 \times X1 + 0.5 \times X2\]

If both \(X1\) and \(X2\) are equal to 0, then \(Y\) equals 0.25, rounded to 0;
If only one of \(X1\) and \(X2\) equals 1, then \(Y\) equals 0.75, rounded to 1; and
If both \(X1\) and \(X2\) equal 1, then \(Y\) equals 1.25, rounded to 1.

X1	X2	Y	Y_hat
0	0	0	0
1	0	1	1
0	1	1	1
1	1	1	1

As observed, when using a logical “OR” rule to generate the data, a linear regression model can also accurately estimate the causal effect if the model is correctly specified.

Are there other non-linear functions that cannot be estimated by a regression model, especially one that strictly involves additive terms?

Causal path with a logical “XOR” rule

Yes. A logical “XOR” rule is one of many data-generation processes that cannot be accurately estimated by such a regression model. This rule dictates that \(Y\) is set to 1, only if both \(X1\) and \(X2\) are either 0 or 1 but not both.

d7_functions <- function_from_edge(d5)

function_Y <- function(X1, X2){
  Y <- as.numeric((X1 == 1 & X2 == 0) | (X1 == 0 & X2 == 1))
  return(Y)
}

d7_functions <- define(d7_functions, which = "X1", what = function_X1)
d7_functions <- define(d7_functions, which = "X2", what = function_X2)
d7_functions <- define(d7_functions, which = "Y", what = function_Y)

set.seed(1)
d7_data <- data_from_function(d7_functions, n = 25000)
d7_reg <- lm(Y ~ X1 + X2, data = d7_data)
d7_results <- tidy(d7_reg)

d7_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d7_test <- mutate(d7_test, Y_hat = predict(d7_reg, newdata = d7_test))
d7_test <- mutate(d7_test, Y_hat = round(Y_hat))

term	estimate	std.error	statistic	p.value
(Intercept)	0.5008329	0.0054806	91.3829889	0.0000000
X1	0.0020795	0.0063249	0.3287718	0.7423309
X2	0.0092509	0.0063247	1.4626646	0.1435717

Using the estimated coefficients of the regression model, we can formulate the linear function:

\[Y = 0.5 + 0 \times X1 + 0 \times X2\]

However, this function cannot represent a logical “XOR” rule:

If both \(X1\) and \(X2\) are equal to 0, then \(Y\) equals 0.5, which is rounded to 1;
If only one of \(X1\) and \(X2\) equals 1, then \(Y\) remains 0.5, also rounded to 1;
If both \(X1\) and \(X2\) equal 1, then \(Y\) again equals 0.5, rounded to 1.

X1	X2	Y	Y_hat
0	0	0	1
1	0	1	1
0	1	1	1
1	1	0	1

Therefore, a linear regression model cannot accurately estimate the causal effect when the data-generation process involves non-linear functions similar to a logical “XOR” rule, even if the model is correctly specified.

Miscellanous

Measurement Error

In the real world, our data consist primarily of measurements of actual variables, which are often latent variables (i.e., abstract constructs). For example, cognitive ability is often measured by an intelligence quotient (IQ) score. Furthermore, we also categorize IQ scores into several categories, which represent another form of variable: IQ category. First, IQ scoring captures only a portion of a cognitive ability, leading to information loss. When IQ is further categorized, even more information is lost.

We can represent a measured variable as a vertex to which an edge is drawn from another vertex that represents the actual variable. For instance, consider a causal path from \(X\) to \(Y\). Let’s denote their measured variables as \(X'\) and \(Y'\), respectively. Since value of \(X'\) depends on \(X\), and \(Y'\) depends on \(Y\), we draw an edge from \(X\) to \(X'\) and another from \(Y\) to \(Y'\). Therefore, in a causal graph, measurement is simply a form of causal path from an actual to a measured variable.

d8 <- data.frame(from = "X", to = "Y")
d8 <- add_row(d8, data.frame(from = "X", to = "Xm"))
d8 <- add_row(d8, data.frame(from = "UX", to = "Xm"))
d8 <- add_row(d8, data.frame(from = "Y", to = "Ym"))
d8 <- add_row(d8, data.frame(from = "UY", to = "Ym"))

If a measurement error introduced in a set of measured variables depends on the same variable \(U_XY\) (UXY) which is beyond their actual variables, then it is called a dependent measurement error; otherwise, it is independent. The figure below shows independent (left) and dependent (right) measurement errors.

If a measurement error introduced in a measured variable depends on any other actual variables beyond its own actual variable, then it is called differential measurement error; otherwise, it is nondifferential. Hence, the previous measurement errors are specifically: (1) independent, nondifferential; and (2) dependent, nondifferential. The figure below shows differential measurement error. Specifically, it shows an independent, differential measurement error. A measurement error of \(Y'\) depends on \(X\) (left), or a measurement error of \(X'\) depends on \(Y\) (right).

Therefore, we can categorize measurement errors based on their dependencies and differentials into four types:

Independent, nondifferential;
Dependent, nondifferential;
Independent, differential; and
Dependent, differential.

Now, let’s generate data with measured variables. We use the same variables and functions but introduce different versions of errors, ranging from smaller to larger errors.

I <- seq(10)
UX_epsilon <- seq(0.005, 4, length = length(I))
UY_epsilon <- seq(0.005, 8, length = length(I))

d8_results <- list()

for(i in I){
  d8_functions <- function_from_edge(d8)
  
  function_X <- function(n){
    X <- rnorm(n, mean = 0, sd = 1)
    return(X)
  }
  
  function_UX <- function(n){
    UX <- rnorm(n, mean = 0, sd = UX_epsilon[[i]])
    return(UX)
  }
  
  function_Xm <- function(X, UX){
    Xm <- X + UX
    return(Xm)
  }
  
  function_Y <- function(X){
    Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
    return(Y)
  }
  
  function_UY <- function(n){
    UY <- rnorm(n, mean = 0, sd = UY_epsilon[[i]])
    return(UY)
  }
  
  function_Ym <- function(Y, UY){
    Ym <- Y + UY
    return(Ym)
  }
  
  d8_functions <- define(d8_functions, which = "X", what = function_X)
  d8_functions <- define(d8_functions, which = "UX", what = function_UX)
  d8_functions <- define(d8_functions, which = "Xm", what = function_Xm)
  d8_functions <- define(d8_functions, which = "Y", what = function_Y)
  d8_functions <- define(d8_functions, which = "UY", what = function_UY)
  d8_functions <- define(d8_functions, which = "Ym", what = function_Ym)
  
  set.seed(1)
  d8_data <- data_from_function(d8_functions, n = 25000)
  d8_reg <- lm(Ym ~ Xm, data = d8_data)
  d8_results[[i]] <- tidy(d8_reg)
}

We observe that larger errors shift the estimated coefficients away from the true values. An error \(\epsilon\) of 2.11 or more leads to the conclusion that the effect of \(X\) on \(Y\) is not significant, even though it should be significant according to the data-generation process.

Missing Value

We can simulate how missing values occur using the same principles as those of measurement error. Missing values introduced in a measured variable depend on its missingness variable \(R\).

There are three types of missing-value mechanisms:

Missing completely at random (MCAR): Missing values depend on neither the actual variable nor other variables. In the figure below, both \(X'\) and \(Y'\) demonstrate MCAR.

d9 <- data.frame(from = "X", to = "Y")
d9 <- add_row(d9, data.frame(from = "X", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "UX", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "RX", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "Y", to = "Ym"))
d9 <- add_row(d9, data.frame(from = "UY", to = "Ym"))
d9 <- add_row(d9, data.frame(from = "RY", to = "Ym"))

Missing at random (MAR): Missing values depend on other variables but not on the actual variable. In the figure below, the missingness variable \(R_X\) (RX) depends on \(Y\), while \(R_Y\) (RY) depends on \(X\). Therefore, both \(X'\) and \(Y'\) are MAR.

Missing not at random (MNAR): Missing values depend on the actual variable, potentially among others. In the figure below, the missingness variable \(R_X\) (RX) depends on \(X\) itself, while \(R_Y\) (RY) depends on \(Y\) itself. Therefore, both \(X'\) and \(Y'\) demonstrate MNAR. This scenario can also includes situations where missingness is influenced by multiple variables. For instance, the figure below shows that the missingness variable \(R_Y\) (RY) depends on both \(X\) and \(Y\). This would still classify as MNAR.

Now, let’s generate data with measured variables where each also depends on its missingness variable \(R\). We use the same variables, functions, and errors but introduce different versions of missingness, ranging from lower to higher proportions of missing values.

I <- seq(10)
RX_p <- seq(0, 0.95, length = length(I))
RX_p <- mapply(c, 1 - RX_p, RX_p)
RY_p <- seq(0, 0.95, length = length(I))
RY_p <- mapply(c, 1 - RY_p, RY_p)

d9_results <- list()

for(i in I){
  d9_functions <- function_from_edge(d9)
  
  function_X <- function(n){
    X <- rnorm(n, mean = 0, sd = 1)
    return(X)
  }
  
  function_UX <- function(n){
    UX <- rnorm(n, mean = 0, sd = 0.01)
    return(UX)
  }
  
  function_RX <- function(n){
    RX <- sample(c(0, 1), size = n, replace = TRUE, prob = RX_p[, i])
    return(RX)
  }
  
  function_Xm <- function(X, UX, RX){
    Xm <- ifelse(RX == 1, NA, X + UX)
    return(Xm)
  }
  
  function_Y <- function(X){
    Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
    return(Y)
  }
  
  function_UY <- function(n){
    UY <- rnorm(n, mean = 0, sd = 0.01)
    return(UY)
  }
  
  function_RY <- function(n){
    RY <- sample(c(0, 1), size = n, replace = TRUE, prob = RY_p[, i])
    return(RY)
  }
  
  function_Ym <- function(Y, UY, RY){
    Ym <- ifelse(RY == 1, NA, Y + UY)
    return(Ym)
  }
  
  d9_functions <- define(d9_functions, which = "X", what = function_X)
  d9_functions <- define(d9_functions, which = "UX", what = function_UX)
  d9_functions <- define(d9_functions, which = "RX", what = function_RX)
  d9_functions <- define(d9_functions, which = "Xm", what = function_Xm)
  d9_functions <- define(d9_functions, which = "Y", what = function_Y)
  d9_functions <- define(d9_functions, which = "UY", what = function_UY)
  d9_functions <- define(d9_functions, which = "RY", what = function_RY)
  d9_functions <- define(d9_functions, which = "Ym", what = function_Ym)
  
  set.seed(1)
  d9_data <- data_from_function(d9_functions, n = 25000)
  d9_reg <- lm(Ym ~ Xm, data = d9_data)
  d9_results[[i]] <- tidy(d9_reg)
}

We observe that higher proportions of missing values shift the estimated coefficients away from the true values. A missing proportion \(p\) of approximately 0.50 or more results in a substantially incorrect estimate for the effect of \(X\) on \(Y\), although the results remain statistically significant in cases of MCAR without missing value imputation.

Time-Varying Causation

The value of a variable at time \(t\) may determine that of the same variable at time \(t+1\). This variable may be a cause (or an exposure/intervention), an effect (or an outcome/competing risk), or a covariate (or a mediator/confounder/collider). A cause-effect relationship that involves such properties is called time-varying causation. The figure below demonstrates both time-varying exposure and outcome. Since a standard causal graph should be a directed acyclic graph (DAG), we represent the same variable at different times as multiple vertices.

To simulate time-varying causation, first, we need to generate data at the initial time \(t=0\). However, time-varying causation must be represented as a directed cyclic graph (DCG), in which loops are allowed. Nevertheless, when generating data at the first step, any loops must be avoided. Hence, any edge from and to the same vertex is not included yet.

d10 <- data.frame(from = "X", to = "Y")

d10_functions <- function_from_edge(d10)

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  X <- abs(X)
  return(X)
}

function_Y <- function(X){
  Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
  return(Y)
}

d10_functions <- define(d10_functions, which = "X", what = function_X)
d10_functions <- define(d10_functions, which = "Y", what = function_Y)

d10_data <- data_from_function(d10_functions, n = 10000)

We must note that rcausim uses looping to simulate time-varying causation without differentiating the names of a variable at different times. For example, the same name \(X\) is used for a variable \(X\) at time \(t\) and \(t+1\); hence, the causal graph will look like the figure below in the context of rcausim. In this example, we use time-varying outcome, where the value of \(Y\) at time \(t\) also determines the value of the same variable at time \(t+1\) in addition to the values of variable \(X\).

Therefore, before generating the data for \(t>0\), we need to add edges from and to the same variables. The function is also modified to include the variable itself as one of the arguments. The class of Functions should be redefined with the modified function.

Since this step assumes we already generated the data at \(t=0\), we need to use the output from the previous step, i.e., d10_data, as the input in this step.

In simulating time-varying causation, we also need to determine a maximum time \(t_{max}\) for each instance to be simulated. The T_max argument facilitates this feature.

d10 <- add_row(d10, data.frame(from = "Y", to = "Y"))

function_Y <- function(X, Y){
  Y <- 0.5 * X + 1.1 * Y
  return(Y)
}

d10_functions <- define(d10_functions, which = "Y", what = function_Y)

set.seed(1)
d10_T_max <- rpois(nrow(d10_data), lambda = 25)
d10_data <- time_varying(d10_functions, data = d10_data, T_max = d10_T_max)

The figure above shows the values of \(X\) and \(Y\) at different times for randomly-sampled instances (top) and their time-wise values summarized from all instances (bottom). We can observe the values of \(Y\) changing over time for each instance, while the values of \(X\) do not. The summary values of \(X\) vary at different times due to inter-instance variability each time.

Bidirectional Causation

Two variables may affect each other simultaneously or in a feedback loop. For example, the value of \(Y\) depends on the value of \(X\), and concurrently, the value of \(X\) also depends on the value of \(Y\). This inter-dependency is known as bidirectional causation.

In practical scenarios, however, even if two variables affect each other, we model their relationship at discrete time steps. Therefore, we can depict bidirectional causation in a standard causal graph which uses a DAG, representing the variables at different times \(t\) and \(t+1\), as shown below.

To simulate bidirectional causation using the time_varying function in rcausim, we first generate data at \(t=0\). This initial step requires a DAG, avoiding any loops.

d11 <- data.frame(from = "X", to = "Y")

d11_functions <- function_from_edge(d11)

function_X <- function(n){
  X <- rnorm(n, mean = 0, sd = 1)
  return(X)
}

function_Y <- function(X){
  Y <- 0.5 * X + rnorm(length(X), mean = 0, sd = 0.00001)
  return(Y)
}

d11_functions <- define(d11_functions, which = "X", what = function_X)
d11_functions <- define(d11_functions, which = "Y", what = function_Y)

d11_data <- data_from_function(d11_functions, n = 10000)

After generating initial data, subsequent time points involve adding reciprocal edges to model the ongoing interaction between \(X\) and \(Y\). In this example, we add an edge from the parent vertex of \(X\)’s vertex, which represents \(Y\). As described in the previous section, we need to provide the initial data d11_data and the maximum times d11_T_max.

d11 <- add_row(d11, data.frame(from = "Y", to = "X"))

function_X <- function(Y){
  X <- 2 * Y + rnorm(length(Y), mean = 0, sd = 0.00001)
  return(X)
}

d11_functions <- define(d11_functions, which = "X", what = function_X)

set.seed(1)
d11_T_max <- rpois(nrow(d11_data), lambda = 25)
d11_data <- time_varying(d11_functions, data = d11_data, T_max = d11_T_max)

The figure above demonstrates the values of \(X\) and \(Y\) over time for randomly-sampled instances (top) and summarizes their time-wise values across all instances (bottom). As time-varying causation, both variables exhibit changes over time, reflecting the bidirectional causation they exert on each other.