---
title: "Estimation of causal effects using stdReg2"
date: "`r format(Sys.Date(), '%Y-%m-%d')`"
output: 
  rmarkdown::html_vignette: 
    highlight: "haddock"
    self_contained: false
vignette: >
  %\VignetteIndexEntry{Estimation of causal effects using stdReg2}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
bibliography: references.bib
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(stdReg2)
```

## Introduction and context

Suppose $X$ denotes an exposure of interest that takes values 0 or 1. This could represent two different medical treatments, environmental exposures, economic policies, or genetic variants. We will most often use biomedical examples because we are biostatisticians. 

We consider the setting where it is of interest to quantify the effect of intervening with $X$ on outcome that we will denote $Y$. The outcome could represent some numeric value, it could be the presence or absence of a condition, or it could be the time between two events, such as time from cancer diagnosis to death. Let $Y(X = x)$ denote the potential outcome if all subjects in the population would hypothetically be exposed to $X = x$. 

To quantify the effect of $X$, we must summarize the distribution of $Y(X = x)$ with some statistic. If $Y$ is dichotomous, it is natural to use $p\{Y(X = x) = 1\}$, called the risk. If $Y$ is continuous, the mean is a natural summary statistic $E\{Y(X = x)\}$. If $Y$ is a continuous time-to-event, the probability of exceeding a particular value $t$ is a reasonable statistic: $p\{Y(X = x) > t\}$. In general, denote the summary statistic of choice as $T\{Y(X = x)\}$. The summary statistic can also be applied to conditional distributions, which we will denote, e.g., $T\{Y | X = x\}$.

To quantify the effect of $X$, we must also decide on a _contrast_ to measure the causal effect. The point of the contrast is to compare the chosen summary statistic between the $X = 1$ and $X = 0$ interventions. Typical choices would be the difference $T\{Y(X = 1)\} - T\{Y(X = 0)\}$ or the ratio $T\{Y(X = 1)\} / T\{Y(X = 0)\}$. It may also be of interest to quantify and report the summary statistics within each group $(T\{Y(X = 1)\}, T\{Y(X = 0)\})$. 

In observational studies, the relationship between $X$ and $Y$ is likely confounded by a set of other variables $\boldsymbol{Z}$. This means that the values of the outcome $Y$ are determined by at least a subset of $\boldsymbol{Z}$ and the values of the exposure $X$ are determined by a subset of $\boldsymbol{Z}$. Naively estimating the contrast would lead to biased estimates of the causal effect. 

## Regression standardization

See @sjolander2016regression and @sjolander2018estimation for more details. Suppose that the covariates $\boldsymbol{Z}$ are sufficient for confounding control. For more information on what constitutes a sufficient adjustment set, see @witte2019covariate. For a given summary statistic $T$, then 
\[
T\{Y(X = x)\} = E_{\boldsymbol{Z}}[T\{Y | X = x, \boldsymbol{Z}\}], 
\]
where the expectation is taken with respect to the population distribution of $\boldsymbol{Z}$. This is also known as the _g-formula_ or _adjustment formula_.

In order to estimate this quantity based on an independent and identically distributed sample $(X_1, Y_1, \boldsymbol{Z}_1), \ldots, (X_n, Y_n, \boldsymbol{Z}_n)$, we proceed by 

1. Specifying and estimating a regression model for $Y$ given $X$ and $\boldsymbol{Z}$. 
2. Use the fitted model to obtain estimates of $T\{Y_i | X_i = x, \boldsymbol{Z}_i\}$ for $i = 1, \ldots, n$. This is done by creating a copy of the observed dataset, replacing the observed $X_i$ with $x$ for each individual, and using the fitted model to get predicted values for the copy of the observed data. Denote these predicted values as $\hat{T}\{Y_i | X_i = x, \boldsymbol{Z}_i\}$.
3. Average over the empirical distribution of $\boldsymbol{Z}$ to obtain the estimate
\[
\hat{T}\{Y(X = x)\} = \frac{1}{n}\sum_{i = 1}^n \hat{T}\{Y_i | X_i = x, \boldsymbol{Z}_i\}.
\]

One can do this for each level of $X = 0, 1$ and compute the desired contrast. Under the assumptions that 1) $\boldsymbol{Z}$ is sufficient for confounding control, and 2) the regression model in step 1 is correctly specified, then this estimator is consistent and asymptotically normal. 


## Improving robustness by modeling the exposure


A doubly-robust estimator is an estimator that is consistent for a given estimand when one or more of the models used in the forming of the estimator is correctly specified for confounding. So far, we only have one model used in our estimator: the _outcome model_. We will now introduce a model for the exposure, and see how we can combine them. First, some terminology: 

_Misspecified model_ - The true generating mechanism is not contained in the possible mechanisms that are possible under the selected model.


_Correctly Specified_ - A model is correctly specified if it is not misspecified.


_Correctly specified for confounding_ - A correctly specified model that contains a sufficient set of confounders. 


If we can model $P(X=1|\boldsymbol{Z})$, and it is correctly specified and contains all confounders, then we can use that to estimate the probability that each individual $i$ received the treatment that they did $W_i = \frac{X_i}{P(X_i=1|\boldsymbol{Z}_i)} + \frac{1-X_i}{1-P(X_i=1|\boldsymbol{Z}_i)}$. Let $\hat{p}_i$ denote the estimated probability that subject $i$ received treatment $1$.

Any consistent estimation method can be used for the outcome and exposure models. As long as _either_ the outcome model _or_ the propensity score model is correctly specified for confounding, then a doubly robust estimator is consistent for the ATE. In this package, for generalized linear models, we use the estimator as described by @gabriel2023inverse. 


Recently, there seems to be a general misconception that combining an adjusted outcome model and a propensity score model always gives a doubly robust estimator. __This is not true__ -- it matters how you combine them! 


# In practice

Here we will use regression standardization to estimate the average causal effect of the exposure (quitting smoking) in the variable `qsmk` on the weight gain outcome in the variable `wt82_71` in the `nhefs_complete` dataset that comes with the `causaldata` package. The data was collected as part of a project by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. It was designed to investigate the relationships between clinical, nutritional, and behavioral factors and subsequent morbidity, mortality, and hospital utilization and changes in risk factors, functional limitation, and institutionalization. The dataset includes 1566 individuals and contains among others the following variables: 

- seqn: person id}
- wt82_71: weight gain in kilograms between 1971 and 1982
- qsmk: quit smoking between between 1st questionnaire and 1982, 1 = yes, 0 = no
- sex: 0 = male, 1 = female
- race: 0 = white, 1 = black or other
- age: age in years at baseline
- education: level of education in 1971, 1 = 8th grade or less, 2 = high school dropout, 3 = high school, 4 -  dropout, 5 = college or more
- smokeintensity: number of cigarettes smoked per day in 1971
- smokeyrs: number of years smoked
- exercise: level of physical activity in 1971, 0 = much exercise, 1 = moderate exercise, 2 = little or no exercise
- active: level of activity in 1971, 0 = very active, 1 = moderately active, 2 = inactive, 3 = missing 
- wt71: weight in kilograms in 1971
- ht: height in centimeters in 1971


```{r}
nhefs_dat <- causaldata::nhefs_complete
summary(nhefs_dat)
```


We will assume that the set of confounders in $\boldsymbol{Z}$ includes sex, race, age, education, number of cigarettes smoked per year, the number of years smoked, level of physical activity, and baseline weight in 1971. This equivalent to assuming that the counterfactual weight gain is independent of the exposure conditional on these variables. In other words, we are assuming the following directed acyclic graph: 

```{r, echo=FALSE, out.width='30%'}
library(survival)
states <- c("X", "Z", "Y")
connect <- matrix(0, 3, 3, dimnames = list(states, states))
connect[cbind(c(1, 2, 2), c(3, 1, 3))] <- 1
statefig(cbind(c(.2, .5, .8), c(.3, .6, .3)), connect)
```

For the specific forms of the conditional expectations required in the outcome we assume a linear regression model with both linear and quadratic forms of the continuous covariates. We can fit this as

```{r}
m <- glm(wt82_71 ~ qsmk + sex + race + age + I(age^2) + 
        as.factor(education) + smokeintensity + I(smokeintensity^2) + 
        smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
          wt71 + I(wt71^2), 
        data = nhefs_dat)
summary(m)
```

To perform regression standardization to estimate the causal effect we use `standardize_glm`. We must specify the same outcome regression model as a formula, provide the data, describe which values of the exposure we wish to estimate the counterfactual means, specify which contrasts we want, and specify the reference level for the contrasts. The following command estimates that model and we obtain the group-wise estimates, the difference, and the ratio.

```{r}
m2 <- standardize_glm(wt82_71 ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
            data = nhefs_dat, 
            values = list(qsmk = c(0,1)),
            contrasts = c("difference", "ratio"),
            reference = 0)

m2

plot(m2)
plot(m2, contrast = "difference", reference = 0)
```

The output from the model shows the estimated potential outcome means in each exposure level, the difference and ratio thereof, and the associated standard error estimates, confidence intervals, and p-values. Inference is done using the sandwich method for variance calculation. Under the assumptions that the outcome model is correctly specified and contains all confounders, these are consistent estimates of the causal effects of interest. We can also get plots of the effects by using the plot function. 

To obtain the estimates and inference for the different contrasts as a tidy data frame, suitable for saving as a data table or using in downstream analyses or reports, we provide the `tidy` function: 

```{r}
(tidy(m2) -> tidy_m2) |> print()
```

To obtain doubly robust inference we use the following command. Note that we now specify a model for the exposure, the propensity score model.

```{r}
m2_dr <- standardize_glm_dr(formula_outcome = wt82_71 ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
               formula_exposure = qsmk ~ sex + race + age + I(age^2) +
                as.factor(education) + smokeintensity + I(smokeintensity^2) + 
                smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + 
                wt71 + I(wt71^2),
            data = nhefs_dat, 
            values = list(qsmk = c(0,1)),
            contrast = c("difference", "ratio"),
            reference = 0) 

m2_dr

(tidy(m2_dr) -> tidy_m2_dr) |> print()
```

Based on these results, we can report that the estimated effect of smoking on average weight change as measured by the difference in potential outcome means is `r with(subset(tidy_m2, contrast == "difference" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))` and as measured by the ratio of potential outcome means is `r with(subset(tidy_m2, contrast == "ratio" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))`. Using the doubly-robust estimation method, we obtain similarly `r with(subset(tidy_m2_dr, contrast == "difference" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))` and as measured by the ratio of potential outcome means is `r with(subset(tidy_m2_dr, contrast == "ratio" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))`. In particular, if we believe that the necessary assumptions are valid, we can conclude that smoking causes an increase in weight which the numeric summaries indicate that smoking increases the weight change over 11 years by about 3.5 pounds (additive scale) or smoking increases the weight change over 11 years by a multiplicative factor of about 3. 

The doubly-robust method requires specification of an exposure model in addition to the specification of the outcome model. Both of these estimated models can be accessed and inspected by looking at the `fit_outcome` and `fit_exposure` elements of the return object: 

```{r}
m2_dr$res$fit_outcome
m2_dr$res$fit_exposure
```

This is useful when we want to inspect the distribution of the propensity scores by exposure status. This is an important thing to check in practice, we are looking for any practical violations of the positivity assumption. 

```{r}
hist(m2_dr$res$fit_exposure$fitted[nhefs_dat$qsmk == 0], 
    xlim = c(0, 1), main = "qsmk = 0", xlab = "estimated propensity")
hist(m2_dr$res$fit_exposure$fitted[nhefs_dat$qsmk == 1], 
    xlim = c(0, 1), main = "qsmk = 1", xlab = "estimated propensity")
```

## Other outcome types

The function `standardize_glm` and its doubly-robust counterpart `standardize_glm_dr` support any outcome variable type the can be used with `glm`. This includes binary, count, and more. Just like in `glm`, we specify the model we would like to use for the outcome by specifying the `family` argument of `standardize_glm` or the `family_outcome` argument of `standardize_glm_dr`. By default this is `"gaussian"` which corresponds to linear regression. For a binary outcome we might like to use `"binomial"` which corresponds to logistic regression. Let us see how it works using the a binary outcome that we create by dichotomizing the weight change at 0.


```{r}
nhefs_dat$gained_weight <- 1.0 * (nhefs_dat$wt82_71 > 0)
m3 <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
            data = nhefs_dat, 
            family = "binomial",
            values = list(qsmk = c(0,1)),
            contrasts = c("difference", "ratio"),
            reference = 0)

m3
```

Here we interpret the estimates in terms of probabilities, or the risk of gaining weight over 11 years. In particular, smoking causes a 13% additive increase in the risk of gaining weight, or a 1.2 fold multiplicative increase in the risk of gaining weight. With binary outcomes, we can use transformations to get different contrasts, and also to improve the performance of the ratio contrast. By using the log transformation combined with the difference contrast, our estimates become the log risk ratios. We can exponential them to get the risk ratios, but often it is better to construct confidence intervals of ratios on the log scale and back transform. 

```{r}
m3_log <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
            data = nhefs_dat, 
            family = "binomial",
            values = list(qsmk = c(0,1)),
            contrasts = c("difference"),
            transforms = c("log"),
            reference = 0)
m3_log
```

The estimates reported in the table are log risk ratios, which we now back transform by exponentiating and compare to the ratios estimated previously. 

```{r}
tm3_log <- tidy(m3_log)
tm3_log$rr <- exp(tm3_log$Estimate)
tm3_log$rr.lower <- exp(tm3_log$lower.0.95)
tm3_log$rr.upper <- exp(tm3_log$upper.0.95)

subset(tm3_log, contrast == "difference" & qsmk == 1)
m3
```

In this case, the estimates and inference using transformation are nearly identical to those without transformation. 

Other available transformations include the odds, and logit (log odds). These can be used to estimate causal odds ratios, untransformed and transformed, respectively. 

```{r}
m3_odds <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
            data = nhefs_dat, 
            family = "binomial",
            values = list(qsmk = c(0,1)),
            contrasts = c("ratio"),
            transforms = c("odds"),
            reference = 0)
m3_odds

m3_logit <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + 
               as.factor(education) + smokeintensity + I(smokeintensity^2) + 
               smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) +
               wt71 + I(wt71^2), 
            data = nhefs_dat, 
            family = "binomial",
            values = list(qsmk = c(0,1)),
            contrasts = c("difference"),
            transforms = c("logit"),
            reference = 0)
m3_logit

m3_logOR <- tidy(m3_logit) |> 
  subset(contrast == "difference" & qsmk == 1) 
  
sprintf("%.2f (%.2f to %.2f)", exp(m3_logOR$Estimate), 
        exp(m3_logOR$lower.0.95), exp(m3_logOR$upper.0.95))
```

In this case the estimate is identical, but the confidence interval is slightly different because it is symmetric on the odds ratio scale in the first case, and symmetric on the log odds ratio scale in the transformed case.