To generate causally-simulated data effectively, it is essential to
understand how to represent cause-and-effect relationships using causal
graphs. We use the igraph
package for creating and
manipulating network structures and the ggnetwork
package
for visualizing these networks within the ggplot2 framework of the
tidyverse
package. Below is how to load these libraries in
R:
library(igraph)
library(ggnetwork)
library(tidyverse)
We will use \(X\) and \(Y\) respectively to denote a cause and an effect. Both are represented as vertices in a graph. If there is a cause-and-effect relationship between both, then an edge is drawn from \(X\) to \(Y\). To do this, we need a 2-column data frame: (1) from, and (2) to.
# Create a data frame for the X and Y relationship
d <- data.frame(from = "X", to = "Y")
from | to |
---|---|
X | Y |
We convert the data frame into an igraph
object as a
directed graph.
# Convert the data frame into an igraph object.
g <- graph_from_data_frame(d, directed = TRUE)
print(g)
## IGRAPH ae95e2f DN-- 2 1 --
## + attr: name (v/c)
## + edge from ae95e2f (vertex names):
## [1] X->Y
To visualize the graph, we need to automatically determine the coordinates to plot \(X\) and \(Y\) and draw a line from \(X\) to \(Y\).
# Lay out the graph as a tree
g_layout <- layout_as_tree(g)
# Determine the coordinates with ggnetwork
set.seed(1)
g_coord <- ggnetwork(g, layout = g_layout)
x | y | name | xend | yend | |
---|---|---|---|---|---|
2 | 0.5 | 1 | X | 0.5 | 0.025 |
1 | 0.5 | 1 | X | 0.5 | 1.000 |
21 | 0.5 | 0 | Y | 0.5 | 0.000 |
To draw the graph, we utilize ggplot
for its flexibility
in customizing graph aesthetics. Use closed and curved arrows to clearly
indicate the direction of the causal effect and helps prevent
overlapping of arrows.
We may need to make the tree layout horizontal. This orientation helps in visualizing the causal flow from left to right, making it easier to follow.
# Define the plot area
g_plot <- ggplot(g_coord, aes(x, y, xend = xend, yend = yend))
# Draw edges with closed, curved arrows to emphasize direction
g_plot <- g_plot + geom_edges(arrow = arrow(type = "closed"), curvature = 0.15)
# Add node labels
g_plot <- g_plot + geom_nodelabel(aes(label = name))
# Make the tree layout horizontal
g_plot <- g_plot + coord_flip()
g_plot <- g_plot + scale_y_reverse()
# Apply a minimal theme
g_plot <- g_plot + theme_void()
# Display the graph
print(g_plot)
A variable is represented as a vertex, and a path is represented as 1 or more edges. Two variables may have a path between them. It means \(X\) and \(Y\) are dependent.
It is also possible for two variables to have no path between them. It means \(X\) and \(Y\) are independent.
If two variables have a path between them, there are four possible paths. First, a path consists of a single edge, which we call a causal path.
Second, a path consists of more than one edges with 1 or more mediators, which we call a mediator path. A mediator is connected by two edges: one from \(X\) to \(M\) and another from \(M\) to \(Y\).
Third, a path consists of more than one edges with 1 or more confounders, which is called a confounder path. A confounder is connected by two edges: one to \(X\) and another to \(Y\).
Fourth, a path consists of more than one edges with 1 or more colliders, i.e., a collider path. A collider is connected by two edges coming from \(X\) and \(Y\) to \(K\).
We will empirically test these theoretical distinctions by generating
causally-simulated data and conducting regression analysis. For this, we
utilize the rcausim
package for data generation and the
broom
package to tidy up the results from our correlation
analysis.
library(rcausim)
library(broom)
We start by defining a Functions
class based on the
specified causal structure in our previous data frame d
.
This class will help specify the required arguments for each variable’s
function.
path1_functions <- function_from_edge(d)
print(path1_functions)
## 0 / 2 vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(X)
Next, we define a function for \(X\)
that generates a normally distributed numeric vector of length
n
.
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
For Y, we create a linear function that is dependent on
X
for \(Y\).
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
We now define path1_functions
using
function_X
and function_Y
.
path1_functions <- define(path1_functions, which = 'X', what = function_X)
path1_functions <- define(path1_functions, which = 'Y', what = function_Y)
print(path1_functions)
## 2 / 2 vertices have functions.
Generate data using path1_functions
.
set.seed(1)
path1_data <- data_from_function(path1_functions, n = 25000)
X | Y |
---|---|
-0.6264538 | -0.2122441 |
0.1836433 | 0.1897219 |
-0.8356286 | -0.3119980 |
1.5952808 | 0.8956116 |
0.3295078 | 0.2684744 |
-0.8204684 | -0.3078511 |
Finally, we perform a regression to infer the correlation between \(X\) and \(Y\).
path1_reg <- lm(Y ~ X, data = path1_data)
path1_results <- tidy(path1_reg)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0999709 | 3.2e-05 | 3126.557 | 0 |
X | 0.5000290 | 3.2e-05 | 15639.931 | 0 |
This output should show the estimated coefficients, their standard errors, \(z\)-values, and \(p\)-values. Results shows that \(X\) has a significant effect on \(Y\) (\(p\)-value <0.05). The estimated coefficients reflect the data generation process, providing a consistency check for our simulation.
For the mediator path involving a mediator \(M\), we create a new data frame
d2
. This data frame specifies the edges from \(X\) to \(M\) and then from \(M\) to \(Y\).
d2 <- data.frame(from = "X", to = "M")
d2 <- add_row(d2, data.frame(from = "M", to = "Y"))
from | to |
---|---|
X | M |
M | Y |
A new class of Functions
is defined using
d2
to establish a path which involves a mediator. This step
helps clarify the arguments in each function of \(X\), \(M\), and \(Y\).
path2_functions <- function_from_edge(d2)
print(path2_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## X = function(n)
## M = function(X)
## Y = function(M)
We reuse the same function for \(X\), where \(X\) is generated from a normal distribution.
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
We also define a function for \(M\), which is directly influenced by \(X\).
function_M <- function(X){
M <- 0.7 * X + rnorm(length(X), mean = 0.3, sd = 0.005)
return(M)
}
Finally, we define a function for \(Y\) as a linear function of \(M\).
function_Y <- function(M){
Y <- 0.5 * M + rnorm(length(M), mean = 0.1, sd = 0.005)
return(Y)
}
We define the path2_functions
before generating
data.
path2_functions <- define(path2_functions, which = 'X', what = function_X)
path2_functions <- define(path2_functions, which = 'M', what = function_M)
path2_functions <- define(path2_functions, which = 'Y', what = function_Y)
print(path2_functions)
## 3 / 3 vertices have functions.
With path2_functions
, we generate data that represent
the mediator path.
set.seed(1)
path2_data <- data_from_function(path2_functions, n = 25000)
X | M | Y |
---|---|---|
-0.6264538 | -0.1375349 | 0.0338620 |
0.1836433 | 0.4264506 | 0.3107876 |
-0.8356286 | -0.2791237 | -0.0338706 |
1.5952808 | 1.4146678 | 0.8134096 |
0.3295078 | 0.5343759 | 0.3650638 |
-0.8204684 | -0.2719448 | -0.0432266 |
Another regression is performed to infer the correlation between \(X\) and \(Y\).
path2_reg <- lm(Y ~ X, data = path2_data)
path2_results <- tidy(path2_reg)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.2499616 | 3.52e-05 | 7110.367 | 0 |
X | 0.3500053 | 3.52e-05 | 9957.264 | 0 |
In our data generation process, \(M\) is defined as \(0.7 \times X + 0.3\), and \(Y\) as \(0.5 \times M + 0.1\); therefore, \(Y(X)\) equals \(0.5 \times (0.7 \times X + 0.3) + 0.1\). Solving this equation, we find that the estimated coefficients reflect the data generation process.
In the third path, both \(X\) and
\(Y\) are influenced by \(C\), as shown in d3
below.
d3 <- data.frame(from = "C", to = "X")
d3 <- add_row(d3, data.frame(from = "C", to = "Y"))
from | to |
---|---|
C | X |
C | Y |
We clarify the arguments in the functions for each variable based on
the data frame d3
.
path3_functions <- function_from_edge(d3)
print(path3_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## C = function(n)
## X = function(C)
## Y = function(C)
We define a normal distribution for \(C\).
function_C <- function(n){
C <- rnorm(n, mean = 0, sd = 1)
return(C)
}
The value of \(X\) depends on \(C\).
function_X <- function(C){
X <- 0.7 * C + rnorm(length(C), mean = 0.3, sd = 0.005)
return(X)
}
Similarly, \(Y\) also depends on \(C\).
function_Y <- function(C){
Y <- 0.5 * C + rnorm(length(C), mean = 0.1, sd = 0.005)
return(Y)
}
The functions are connected into path3_functions
path3_functions <- define(path3_functions, which = 'C', what = function_C)
path3_functions <- define(path3_functions, which = 'X', what = function_X)
path3_functions <- define(path3_functions, which = 'Y', what = function_Y)
print(path3_functions)
## 3 / 3 vertices have functions.
Data is generated using path3_functions
.
set.seed(1)
path3_data <- data_from_function(path3_functions, n = 25000)
C | X | Y |
---|---|---|
-0.6264538 | -0.1375349 | -0.2105975 |
0.1836433 | 0.4264506 | 0.1893839 |
-0.8356286 | -0.2791237 | -0.3121231 |
1.5952808 | 1.4146678 | 0.9037161 |
0.3295078 | 0.5343759 | 0.2626297 |
-0.8204684 | -0.2719448 | -0.3174884 |
A regression examines the correlation between \(X\) and \(Y\).
path3_reg <- lm(Y ~ X, data = path3_data)
path3_results <- tidy(path3_reg)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.1142645 | 4.27e-05 | -2676.865 | 0 |
X | 0.7142049 | 5.60e-05 | 12748.446 | 0 |
Results indicate a statistically significant correlation between \(X\) and \(Y\), although there is no edge directly from \(X\) to \(Y\). Notably, the coefficients are not identical to those used in the data generation process.
In this path, both \(X\) and \(Y\) both influence \(K\) but do not influence each other. This
scenario is illustrated by d4
.
d4 <- data.frame(from = "X", to = "K")
d4 <- add_row(d4, data.frame(from = "Y", to = "K"))
from | to |
---|---|
X | K |
Y | K |
A class of Functions
is created to clarify the arguments
in the functions for \(X\), \(Y\), and \(K\).
path4_functions <- function_from_edge(d4)
print(path4_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(n)
## K = function(X,Y)
The function for \(X\) determines its values to be normally-distributed.
Both \(X\) and \(Y\) are independently defined with normally distributed values.
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_Y <- function(n){
Y <- rnorm(n, mean = 0.1, sd = 0.005)
return(Y)
}
The value of \(K\) depends on both \(X\) and \(Y\).
function_K <- function(X, Y){
K <- 0.7 * X + 0.5 * Y + rnorm(length(X), mean = 0.1, sd = 0.005)
return(K)
}
Connect all functions into path4_functions
.
path4_functions <- define(path4_functions, which = 'X', what = function_X)
path4_functions <- define(path4_functions, which = 'Y', what = function_Y)
path4_functions <- define(path4_functions, which = 'K', what = function_K)
print(path4_functions)
## 3 / 3 vertices have functions.
Finally, data are generated using path4_functions
set.seed(1)
path4_data <- data_from_function(path4_functions, n = 25000)
X | Y | K |
---|---|---|
-0.6264538 | 0.1009828 | -0.2853968 |
0.1836433 | 0.0979003 | 0.2750627 |
-0.8356286 | 0.1058163 | -0.4263406 |
1.5952808 | 0.0979712 | 1.2717578 |
0.3295078 | 0.1037205 | 0.3803915 |
-0.8204684 | 0.1023831 | -0.4303905 |
We then evaluate the correlation between \(X\) and \(Y\) by conducting a regression.
path4_reg <- lm(Y ~ X, data = path4_data)
path4_results <- tidy(path4_reg)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0999709 | 3.2e-05 | 3126.5572990 | 0.0000000 |
X | 0.0000290 | 3.2e-05 | 0.9071066 | 0.3643592 |
Results shows no significant correlation between X and Y, although there is a path between them.
Among the four paths, only causal and mediator paths represent causal relationships, while the others are non-causal. We can guess the causal structure based on these properties using directional (\(d\))-separation. It is a criterion used to determine whether a set of variables, \(Z\), provides conditional independence between the pair of other variables (i.e., \(X\) and \(Y\)) within a causal graph.
Specifically, we need to identify if \(X\) and \(Y\) are dependent, then we condition them on \(Z\). If \(X\) and \(Y\) is conditionally independent, then there is no causal path between them. The rules of \(d\)-separation are:
Notably, when determining causal relationships, we include \(M\) in \(Z\) only if we aim to identify if there are any causal paths from \(X\) to \(Y\) that do not pass through \(M\).
Now, we create additional versions of the mediator, confounder, and collider paths, each including the causal path. These versions, without and with a causal path, represent scenarios in which the null hypothesis and the alternative hypothesis are respectively accepted.
d2_with_d <- add_row(d2, d)
d3_with_d <- add_row(d3, d)
d4_with_d <- add_row(d4, d)
Below we outline the data generation process in which the null hypothesis (left) and the alternative hypothesis (right) are accepted.
A regression was already conducted for the mediator path without a
causal path, identified as path2_results
. We need to apply
the same analysis to the version with a causal path.
path2_d_functions <- function_from_edge(d2_with_d)
print(path2_d_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## X = function(n)
## M = function(X)
## Y = function(M,X)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_M <- function(X){
M <- 0.7 * X + rnorm(length(X), mean = 0.3, sd = 0.005)
return(M)
}
function_Y <- function(X, M){
Y <- 0.25 * X + 0.5 * M + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
path2_d_functions <- define(path2_d_functions, which = "X", what = function_X)
path2_d_functions <- define(path2_d_functions, which = "M", what = function_M)
path2_d_functions <- define(path2_d_functions, which = "Y", what = function_Y)
set.seed(1)
path2_d_data <- data_from_function(path2_d_functions, n = 25000)
path2_d_reg <- lm(Y ~ X, data = path2_d_data)
path2_d_results <- tidy(path2_d_reg)
We condition each version on \(Z\). In this case, it includes \(M\) to identify any causal paths from \(X\) to \(Y\) that do not pass through \(M\).
path2_cond_reg <- lm(Y ~ X + M, data = path2_data)
path2_cond_results <- tidy(path2_cond_reg)
path2_cond_d_reg <- lm(Y ~ X + M, data = path2_d_data)
path2_cond_d_results <- tidy(path2_cond_d_reg)
causal_path | conditioned | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
No | No | (Intercept) | 0.2499616 | 0.0000352 | 7110.367085 | 0.0000000 |
No | No | X | 0.3500053 | 0.0000352 | 9957.263573 | 0.0000000 |
No | Yes | (Intercept) | 0.1044050 | 0.0018721 | 55.769205 | 0.0000000 |
No | Yes | X | 0.0103260 | 0.0043683 | 2.363846 | 0.0180939 |
No | Yes | M | 0.4852360 | 0.0062400 | 77.761933 | 0.0000000 |
Yes | No | (Intercept) | 0.2499616 | 0.0000352 | 7110.367085 | 0.0000000 |
Yes | No | X | 0.6000053 | 0.0000352 | 17069.487695 | 0.0000000 |
Yes | Yes | (Intercept) | 0.1044050 | 0.0018721 | 55.769205 | 0.0000000 |
Yes | Yes | X | 0.2603260 | 0.0043683 | 59.594236 | 0.0000000 |
Yes | Yes | M | 0.4852360 | 0.0062400 | 77.761933 | 0.0000000 |
The regression results show that the coefficient of \(X\) is reduced from approximately 0.35 to nearly 0 when we condition on \(Z\) (which includes the mediator \(M\)) in the first version. When conditioning the second version on \(Z\), the coefficient of \(X\) is also reduced, from about 0.60 to 0.26; however, the effect of \(X\) on \(Y\) remains significant. This indicates that in the first scenario, conditioning on \(M\) successfully blocks the path from \(X\) to \(Y\), supporting the hypothesis of \(M\) being a complete mediator. In contrast, in the second scenario, \(M\) does not completely mediate the relationship between \(X\) and \(Y\), as \(X\) still has a direct effect on \(Y\).
We now apply the data generation process similarly, but in this instance, \(Z\) includes \(C\), as illustrated below.
In addition to path3_results
for the first version, we
need to conduct a regression analysis for the second version.
path3_d_functions <- function_from_edge(d3_with_d)
print(path3_d_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## C = function(n)
## X = function(C)
## Y = function(C,X)
function_C <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_X <- function(C){
X <- 0.7 * C + rnorm(length(C), mean = 0.3, sd = 0.005)
return(X)
}
function_Y <- function(X, C){
Y <- 0.25 * X + 0.5 * C + rnorm(length(C), mean = 0.1, sd = 0.005)
return(Y)
}
path3_d_functions <- define(path3_d_functions, which = "C", what = function_C)
path3_d_functions <- define(path3_d_functions, which = "X", what = function_X)
path3_d_functions <- define(path3_d_functions, which = "Y", what = function_Y)
set.seed(1)
path3_d_data <- data_from_function(path3_d_functions, n = 25000)
path3_d_reg <- lm(Y ~ X, data = path3_d_data)
path3_d_results <- tidy(path3_d_reg)
Both versions are conditioned on \(Z\) to determine any causal paths from \(X\) to \(Y\) that can be identified when the non-causal path involving \(C\) is blocked.
path3_cond_reg <- lm(Y ~ X + C, data = path3_data)
path3_cond_results <- tidy(path3_cond_reg)
path3_cond_d_reg <- lm(Y ~ X + C, data = path3_d_data)
path3_cond_d_results <- tidy(path3_cond_d_reg)
causal_path | conditioned | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
No | No | (Intercept) | -0.1142645 | 0.0000427 | -2676.865003 | 0.000000 |
No | No | X | 0.7142049 | 0.0000560 | 12748.446459 | 0.000000 |
No | Yes | (Intercept) | 0.1044050 | 0.0018721 | 55.769205 | 0.000000 |
No | Yes | X | -0.0147640 | 0.0062400 | -2.366022 | 0.017988 |
No | Yes | C | 0.5103260 | 0.0043683 | 116.824627 | 0.000000 |
Yes | No | (Intercept) | -0.1142645 | 0.0000427 | -2676.865003 | 0.000000 |
Yes | No | X | 0.9642049 | 0.0000560 | 17210.907481 | 0.000000 |
Yes | Yes | (Intercept) | 0.1044050 | 0.0018721 | 55.769205 | 0.000000 |
Yes | Yes | X | 0.2352360 | 0.0062400 | 37.697956 | 0.000000 |
Yes | Yes | C | 0.5103260 | 0.0043683 | 116.824627 | 0.000000 |
For the first version, the coefficient of \(X\) is significantly reduced from approximately 0.7 to nearly 0 when we condition on \(Z\) (which includes the confounder \(C\)). In the second version, the coefficient is reduced from 0.96 to 0.24; however, \(X\) still exerts a significant effect on \(Y\).
Unlike the mediator and confounder paths, the collider path exhibits an opposite result when we condition the effect of \(X\) on \(Y\).
We proceed with a similar analysis.
path4_d_functions <- function_from_edge(d4_with_d)
print(path4_d_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## X = function(n)
## Y = function(X)
## K = function(X,Y)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
function_K <- function(X, Y){
K <- 0.7 * X + 0.5 * Y + rnorm(length(X), mean = 0.1, sd = 0.005)
return(K)
}
path4_d_functions <- define(path4_d_functions, which = "X", what = function_X)
path4_d_functions <- define(path4_d_functions, which = "Y", what = function_Y)
path4_d_functions <- define(path4_d_functions, which = "K", what = function_K)
set.seed(1)
path4_d_data <- data_from_function(path4_d_functions, n = 25000)
path4_d_reg <- lm(Y ~ X, data = path4_d_data)
path4_d_results <- tidy(path4_d_reg)
The effect of \(X\) on \(Y\) is also conditioned on \(Z\) which includes \(K\). Notably, this approach violates the rules of \(d\)-separation.
path4_cond_reg <- lm(Y ~ X + K, data = path4_data)
path4_cond_results <- tidy(path4_cond_reg)
path4_cond_d_reg <- lm(Y ~ X + K, data = path4_d_data)
path4_cond_d_results <- tidy(path4_cond_d_reg)
causal_path | conditioned | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
No | No | (Intercept) | 0.0999709 | 0.0000320 | 3126.5572990 | 0.0000000 |
No | No | X | 0.0000290 | 0.0000320 | 0.9071066 | 0.3643592 |
No | Yes | (Intercept) | 0.0397725 | 0.0007747 | 51.3413419 | 0.0000000 |
No | Yes | X | -0.2809707 | 0.0036137 | -77.7514571 | 0.0000000 |
No | Yes | K | 0.4014251 | 0.0051622 | 77.7619331 | 0.0000000 |
Yes | No | (Intercept) | 0.0999709 | 0.0000320 | 3126.5572990 | 0.0000000 |
Yes | No | X | 0.5000290 | 0.0000320 | 15639.9309158 | 0.0000000 |
Yes | Yes | (Intercept) | 0.0397725 | 0.0007747 | 51.3413419 | 0.0000000 |
Yes | Yes | X | 0.1186730 | 0.0049042 | 24.1980828 | 0.0000000 |
Yes | Yes | K | 0.4014251 | 0.0051622 | 77.7619331 | 0.0000000 |
Now, we can observe why including \(K\) in \(Z\) is problematic. In the first version, the coefficient of \(X\) changes from nearly 0 to -0.28, indicating a significant effect on \(Y\). In our data generation process for this scenario, \(X\) and \(Y\) are independent. In the second version, they are dependent, but conditioning on \(Z\) reverses the effect in the same direction observed in the first version. Hence, including \(K\) in \(Z\) reverses the relationship between a pair of other variables in a causal graph.
In the previous section, we used different causal structures but similar functions to generate the data, specifically using linear functions. If the linear regression model is correctly specified, then the estimated coefficients should reflect the data generation process. However, what if we use non-linear functions to generate the data? Can the linear regression model, especially one that strictly involves additive terms, still accurately estimate the causal effect when properly specified?
Below we create a data frame d5
which has a simple
causal structure, i.e., two causal paths from either \(X1\) and \(X2\) to \(Y\).
d5 <- data.frame(from = "X1", to = "Y")
d5 <- add_row(d5, data.frame(from = "X2", to = "Y"))
We also define a class of Functions
to clarify the
arguments required for each function.
d5_functions <- function_from_edge(d5)
print(d5_functions)
## 0 / 3 vertices have functions.
## Please define functions for:
## X1 = function(n)
## X2 = function(n)
## Y = function(X1,X2)
For both \(X1\) and \(X2\), we determine their functions to
generate data of 0 or 1 with the length n
. Meanwhile, \(Y\) is determine by its function to apply a
logical “AND” rule: if \(X1\) and \(X2\) are both equal to 1, then \(Y\) is equal to 1; otherwise, it is 0.
Interpreting the correctness of the estimated coefficient is not as intuitive as when using a linear function. Nonetheless, the logical “AND” rule for generating the data can be represented by the linear function:
\[Y = -0.25 + 0.5 \times X1 + 0.5 \times X2\]
We conduct a regression using the generated data to estimate the effects of either \(X1\) or \(X2\) on \(Y\).
function_X1 <- function(n){
X1 <- sample(c(0, 1), n, replace = T)
return(X1)
}
function_X2 <- function(n){
X2 <- sample(c(0, 1), n, replace = T)
return(X2)
}
function_Y <- function(X1, X2){
Y <- as.numeric(X1 == 1 & X2 == 1)
return(Y)
}
d5_functions <- define(d5_functions, which = "X1", what = function_X1)
d5_functions <- define(d5_functions, which = "X2", what = function_X2)
d5_functions <- define(d5_functions, which = "Y", what = function_Y)
set.seed(1)
d5_data <- data_from_function(d5_functions, n = 25000)
d5_reg <- lm(Y ~ X1 + X2, data = d5_data)
d5_results <- tidy(d5_reg)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.2504165 | 0.0027403 | -91.38299 | 0 |
X1 | 0.4989603 | 0.0031625 | 157.77597 | 0 |
X2 | 0.4953746 | 0.0031623 | 156.64850 | 0 |
The results accurately show the estimated coefficients for \(X1\) and \(X2\). The interpretation is even more intuitive in this special case if we simply compute \(\hat{Y}\) (Y_hat) to predict \(Y\), where \(Y = \hat{Y} + \epsilon\); thus, \(\epsilon\) is the error of the model in predicting \(Y\).
We create test data which simply includes all the possible combinations of \(X1\) and \(X2\). Then, we compute \(\hat{Y}\) using the test data and the regression model fitted to the generated data.
test_data <- expand.grid(X1 = c(0, 1), X2 = c(0, 1))
d5_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d5_test <- mutate(d5_test, Y_hat = predict(d5_reg, newdata = d5_test))
d5_test <- mutate(d5_test, Y_hat = round(Y_hat))
X1 | X2 | Y | Y_hat |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
1 | 1 | 1 | 1 |
As observed, even when using a logical “AND” rule to generate the data, a linear regression model can still accurately estimate the causal effect if the model is correctly specified.
Now, we apply the same approach but using a logical “OR” rule. If either \(X1\) or \(X2\) equals 1, then \(Y\) is set to 1.
d6_functions <- function_from_edge(d5)
function_Y <- function(X1, X2){
Y <- as.numeric(X1 == 1 | X2 == 1)
return(Y)
}
d6_functions <- define(d6_functions, which = "X1", what = function_X1)
d6_functions <- define(d6_functions, which = "X2", what = function_X2)
d6_functions <- define(d6_functions, which = "Y", what = function_Y)
set.seed(1)
d6_data <- data_from_function(d6_functions, n = 25000)
d6_reg <- lm(Y ~ X1 + X2, data = d6_data)
d6_results <- tidy(d6_reg)
d6_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d6_test <- mutate(d6_test, Y_hat = predict(d6_reg, newdata = d6_test))
d6_test <- mutate(d6_test, Y_hat = round(Y_hat))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.2504165 | 0.0027403 | 91.38299 | 0 |
X1 | 0.5010397 | 0.0031625 | 158.43351 | 0 |
X2 | 0.5046254 | 0.0031623 | 159.57383 | 0 |
As previously described, we can also represent the logical “OR” rule with the linear function:
\[Y = 0.25 + 0.5 \times X1 + 0.5 \times X2\]
X1 | X2 | Y | Y_hat |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 |
1 | 1 | 1 | 1 |
As observed, when using a logical “OR” rule to generate the data, a linear regression model can also accurately estimate the causal effect if the model is correctly specified.
Are there other non-linear functions that cannot be estimated by a regression model, especially one that strictly involves additive terms?
Yes. A logical “XOR” rule is one of many data-generation processes that cannot be accurately estimated by such a regression model. This rule dictates that \(Y\) is set to 1, only if both \(X1\) and \(X2\) are either 0 or 1 but not both.
d7_functions <- function_from_edge(d5)
function_Y <- function(X1, X2){
Y <- as.numeric((X1 == 1 & X2 == 0) | (X1 == 0 & X2 == 1))
return(Y)
}
d7_functions <- define(d7_functions, which = "X1", what = function_X1)
d7_functions <- define(d7_functions, which = "X2", what = function_X2)
d7_functions <- define(d7_functions, which = "Y", what = function_Y)
set.seed(1)
d7_data <- data_from_function(d7_functions, n = 25000)
d7_reg <- lm(Y ~ X1 + X2, data = d7_data)
d7_results <- tidy(d7_reg)
d7_test <- mutate(test_data, Y = mapply(function_Y, X1, X2))
d7_test <- mutate(d7_test, Y_hat = predict(d7_reg, newdata = d7_test))
d7_test <- mutate(d7_test, Y_hat = round(Y_hat))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.5008329 | 0.0054806 | 91.3829889 | 0.0000000 |
X1 | 0.0020795 | 0.0063249 | 0.3287718 | 0.7423309 |
X2 | 0.0092509 | 0.0063247 | 1.4626646 | 0.1435717 |
Using the estimated coefficients of the regression model, we can formulate the linear function:
\[Y = 0.5 + 0 \times X1 + 0 \times X2\]
However, this function cannot represent a logical “XOR” rule:
X1 | X2 | Y | Y_hat |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 |
1 | 1 | 0 | 1 |
Therefore, a linear regression model cannot accurately estimate the causal effect when the data-generation process involves non-linear functions similar to a logical “XOR” rule, even if the model is correctly specified.
In the real world, our data consist primarily of measurements of actual variables, which are often latent variables (i.e., abstract constructs). For example, cognitive ability is often measured by an intelligence quotient (IQ) score. Furthermore, we also categorize IQ scores into several categories, which represent another form of variable: IQ category. First, IQ scoring captures only a portion of a cognitive ability, leading to information loss. When IQ is further categorized, even more information is lost.
We can represent a measured variable as a vertex to which an edge is drawn from another vertex that represents the actual variable. For instance, consider a causal path from \(X\) to \(Y\). Let’s denote their measured variables as \(X'\) and \(Y'\), respectively. Since value of \(X'\) depends on \(X\), and \(Y'\) depends on \(Y\), we draw an edge from \(X\) to \(X'\) and another from \(Y\) to \(Y'\). Therefore, in a causal graph, measurement is simply a form of causal path from an actual to a measured variable.
d8 <- data.frame(from = "X", to = "Y")
d8 <- add_row(d8, data.frame(from = "X", to = "Xm"))
d8 <- add_row(d8, data.frame(from = "UX", to = "Xm"))
d8 <- add_row(d8, data.frame(from = "Y", to = "Ym"))
d8 <- add_row(d8, data.frame(from = "UY", to = "Ym"))
If a measurement error introduced in a set of measured variables depends on the same variable \(U_XY\) (UXY) which is beyond their actual variables, then it is called a dependent measurement error; otherwise, it is independent. The figure below shows independent (left) and dependent (right) measurement errors.
If a measurement error introduced in a measured variable depends on any other actual variables beyond its own actual variable, then it is called differential measurement error; otherwise, it is nondifferential. Hence, the previous measurement errors are specifically: (1) independent, nondifferential; and (2) dependent, nondifferential. The figure below shows differential measurement error. Specifically, it shows an independent, differential measurement error. A measurement error of \(Y'\) depends on \(X\) (left), or a measurement error of \(X'\) depends on \(Y\) (right).
Therefore, we can categorize measurement errors based on their dependencies and differentials into four types:
Now, let’s generate data with measured variables. We use the same variables and functions but introduce different versions of errors, ranging from smaller to larger errors.
I <- seq(10)
UX_epsilon <- seq(0.005, 4, length = length(I))
UY_epsilon <- seq(0.005, 8, length = length(I))
d8_results <- list()
for(i in I){
d8_functions <- function_from_edge(d8)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_UX <- function(n){
UX <- rnorm(n, mean = 0, sd = UX_epsilon[[i]])
return(UX)
}
function_Xm <- function(X, UX){
Xm <- X + UX
return(Xm)
}
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
function_UY <- function(n){
UY <- rnorm(n, mean = 0, sd = UY_epsilon[[i]])
return(UY)
}
function_Ym <- function(Y, UY){
Ym <- Y + UY
return(Ym)
}
d8_functions <- define(d8_functions, which = "X", what = function_X)
d8_functions <- define(d8_functions, which = "UX", what = function_UX)
d8_functions <- define(d8_functions, which = "Xm", what = function_Xm)
d8_functions <- define(d8_functions, which = "Y", what = function_Y)
d8_functions <- define(d8_functions, which = "UY", what = function_UY)
d8_functions <- define(d8_functions, which = "Ym", what = function_Ym)
set.seed(1)
d8_data <- data_from_function(d8_functions, n = 25000)
d8_reg <- lm(Ym ~ Xm, data = d8_data)
d8_results[[i]] <- tidy(d8_reg)
}
We observe that larger errors shift the estimated coefficients away from the true values. An error \(\epsilon\) of 2.11 or more leads to the conclusion that the effect of \(X\) on \(Y\) is not significant, even though it should be significant according to the data-generation process.
We can simulate how missing values occur using the same principles as those of measurement error. Missing values introduced in a measured variable depend on its missingness variable \(R\).
There are three types of missing-value mechanisms:
d9 <- data.frame(from = "X", to = "Y")
d9 <- add_row(d9, data.frame(from = "X", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "UX", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "RX", to = "Xm"))
d9 <- add_row(d9, data.frame(from = "Y", to = "Ym"))
d9 <- add_row(d9, data.frame(from = "UY", to = "Ym"))
d9 <- add_row(d9, data.frame(from = "RY", to = "Ym"))
Now, let’s generate data with measured variables where each also depends on its missingness variable \(R\). We use the same variables, functions, and errors but introduce different versions of missingness, ranging from lower to higher proportions of missing values.
I <- seq(10)
RX_p <- seq(0, 0.95, length = length(I))
RX_p <- mapply(c, 1 - RX_p, RX_p)
RY_p <- seq(0, 0.95, length = length(I))
RY_p <- mapply(c, 1 - RY_p, RY_p)
d9_results <- list()
for(i in I){
d9_functions <- function_from_edge(d9)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_UX <- function(n){
UX <- rnorm(n, mean = 0, sd = 0.01)
return(UX)
}
function_RX <- function(n){
RX <- sample(c(0, 1), size = n, replace = TRUE, prob = RX_p[, i])
return(RX)
}
function_Xm <- function(X, UX, RX){
Xm <- ifelse(RX == 1, NA, X + UX)
return(Xm)
}
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
function_UY <- function(n){
UY <- rnorm(n, mean = 0, sd = 0.01)
return(UY)
}
function_RY <- function(n){
RY <- sample(c(0, 1), size = n, replace = TRUE, prob = RY_p[, i])
return(RY)
}
function_Ym <- function(Y, UY, RY){
Ym <- ifelse(RY == 1, NA, Y + UY)
return(Ym)
}
d9_functions <- define(d9_functions, which = "X", what = function_X)
d9_functions <- define(d9_functions, which = "UX", what = function_UX)
d9_functions <- define(d9_functions, which = "RX", what = function_RX)
d9_functions <- define(d9_functions, which = "Xm", what = function_Xm)
d9_functions <- define(d9_functions, which = "Y", what = function_Y)
d9_functions <- define(d9_functions, which = "UY", what = function_UY)
d9_functions <- define(d9_functions, which = "RY", what = function_RY)
d9_functions <- define(d9_functions, which = "Ym", what = function_Ym)
set.seed(1)
d9_data <- data_from_function(d9_functions, n = 25000)
d9_reg <- lm(Ym ~ Xm, data = d9_data)
d9_results[[i]] <- tidy(d9_reg)
}
We observe that higher proportions of missing values shift the estimated coefficients away from the true values. A missing proportion \(p\) of approximately 0.50 or more results in a substantially incorrect estimate for the effect of \(X\) on \(Y\), although the results remain statistically significant in cases of MCAR without missing value imputation.
The value of a variable at time \(t\) may determine that of the same variable at time \(t+1\). This variable may be a cause (or an exposure/intervention), an effect (or an outcome/competing risk), or a covariate (or a mediator/confounder/collider). A cause-effect relationship that involves such properties is called time-varying causation. The figure below demonstrates both time-varying exposure and outcome. Since a standard causal graph should be a directed acyclic graph (DAG), we represent the same variable at different times as multiple vertices.
To simulate time-varying causation, first, we need to generate data at the initial time \(t=0\). However, time-varying causation must be represented as a directed cyclic graph (DCG), in which loops are allowed. Nevertheless, when generating data at the first step, any loops must be avoided. Hence, any edge from and to the same vertex is not included yet.
d10 <- data.frame(from = "X", to = "Y")
d10_functions <- function_from_edge(d10)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
X <- abs(X)
return(X)
}
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0.1, sd = 0.005)
return(Y)
}
d10_functions <- define(d10_functions, which = "X", what = function_X)
d10_functions <- define(d10_functions, which = "Y", what = function_Y)
d10_data <- data_from_function(d10_functions, n = 10000)
We must note that rcausim
uses looping to simulate
time-varying causation without differentiating the names of a variable
at different times. For example, the same name \(X\) is used for a variable \(X\) at time \(t\) and \(t+1\); hence, the causal graph will look
like the figure below in the context of rcausim
. In this
example, we use time-varying outcome, where the value of \(Y\) at time \(t\) also determines the value of the same
variable at time \(t+1\) in addition to
the values of variable \(X\).
Therefore, before generating the data for \(t>0\), we need to add edges from and to
the same variables. The function is also modified to include the
variable itself as one of the arguments. The class of
Functions
should be redefined with the modified
function.
Since this step assumes we already generated the data at \(t=0\), we need to use the output from the
previous step, i.e., d10_data
, as the input in this
step.
In simulating time-varying causation, we also need to determine a
maximum time \(t_{max}\) for each
instance to be simulated. The T_max
argument facilitates
this feature.
d10 <- add_row(d10, data.frame(from = "Y", to = "Y"))
function_Y <- function(X, Y){
Y <- 0.5 * X + 1.1 * Y
return(Y)
}
d10_functions <- define(d10_functions, which = "Y", what = function_Y)
set.seed(1)
d10_T_max <- rpois(nrow(d10_data), lambda = 25)
d10_data <- time_varying(d10_functions, data = d10_data, T_max = d10_T_max)
The figure above shows the values of \(X\) and \(Y\) at different times for randomly-sampled instances (top) and their time-wise values summarized from all instances (bottom). We can observe the values of \(Y\) changing over time for each instance, while the values of \(X\) do not. The summary values of \(X\) vary at different times due to inter-instance variability each time.
Two variables may affect each other simultaneously or in a feedback loop. For example, the value of \(Y\) depends on the value of \(X\), and concurrently, the value of \(X\) also depends on the value of \(Y\). This inter-dependency is known as bidirectional causation.
In practical scenarios, however, even if two variables affect each other, we model their relationship at discrete time steps. Therefore, we can depict bidirectional causation in a standard causal graph which uses a DAG, representing the variables at different times \(t\) and \(t+1\), as shown below.
To simulate bidirectional causation using the
time_varying
function in rcausim
, we first
generate data at \(t=0\). This initial
step requires a DAG, avoiding any loops.
d11 <- data.frame(from = "X", to = "Y")
d11_functions <- function_from_edge(d11)
function_X <- function(n){
X <- rnorm(n, mean = 0, sd = 1)
return(X)
}
function_Y <- function(X){
Y <- 0.5 * X + rnorm(length(X), mean = 0, sd = 0.00001)
return(Y)
}
d11_functions <- define(d11_functions, which = "X", what = function_X)
d11_functions <- define(d11_functions, which = "Y", what = function_Y)
d11_data <- data_from_function(d11_functions, n = 10000)
After generating initial data, subsequent time points involve adding
reciprocal edges to model the ongoing interaction between \(X\) and \(Y\). In this example, we add an edge from
the parent vertex of \(X\)’s vertex,
which represents \(Y\). As described in
the previous section, we need to provide the initial data
d11_data
and the maximum times d11_T_max
.
d11 <- add_row(d11, data.frame(from = "Y", to = "X"))
function_X <- function(Y){
X <- 2 * Y + rnorm(length(Y), mean = 0, sd = 0.00001)
return(X)
}
d11_functions <- define(d11_functions, which = "X", what = function_X)
set.seed(1)
d11_T_max <- rpois(nrow(d11_data), lambda = 25)
d11_data <- time_varying(d11_functions, data = d11_data, T_max = d11_T_max)
The figure above demonstrates the values of \(X\) and \(Y\) over time for randomly-sampled instances (top) and summarizes their time-wise values across all instances (bottom). As time-varying causation, both variables exhibit changes over time, reflecting the bidirectional causation they exert on each other.