--- title: "Multibias example: Evans" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multibias example: Evans} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(multibias) ``` We are interested in quantifying the effect of smoking (SMK) on coronary heart disease (CHD) using data from the [Evans County Heart Study](https://en.wikipedia.org/wiki/Evans_County_Heart_Study). Let's inspect the dataframe we have for 609 participants aged 40 and older. ```{r, eval = TRUE} head(evans) ``` This is clearly not the most robust data, but, for purposes of demonstrating `multibias`, let's proceed by pretending that our data was missing information on AGE. Let's inspect the resulting effect estimate, adjusted for hypertension (HPT). ```{r, eval = TRUE} biased_model <- glm(CHD ~ SMK + HPT, family = binomial(link = "logit"), data = evans ) or <- round(exp(coef(biased_model)[2]), 2) or_ci_low <- round( exp(coef(biased_model)[2] - 1.96 * summary(biased_model)$coef[2, 2]), 2 ) or_ci_high <- round( exp(coef(biased_model)[2] + 1.96 * summary(biased_model)$coef[2, 2]), 2 ) print(paste0("Biased Odds Ratio: ", or)) print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")")) ``` This estimate is around what we'd expect. In the IJE article [The association between tobacco smoking and coronary heart disease](https://doi.org/10.1093/ije/dyv124) the author notes: "Cigarette smokers have about twice as much coronary heart disease as non-smokers, whether measured by deaths, prevalence, or the incidence of new events." However, we also know from studies like [this BMJ meta-analysis](https://doi.org/10.1136/bmj.j5855) that the effect estimate can vary greatly depending on the degree of cigarette consumption. Can we anticipate whether this odds ratio without age-adjustment is biased towards or away from the null? Let's consider the association of the uncontrolled confounder with the exposure and outcome. ```{r} cor(evans$SMK, evans$AGE) cor(evans$CHD, evans$AGE) ``` In our data, AGE has a negative association with SMK (older people are **less** likely to be smokers) and a positive association with CHD (older people are **more** likely to have CHD). These opposite associations must be biasing the odds ratio towards the null, creating a distortion where those who are less likely to smoke are more likely to experience the outcome. We'll treat AGE as a binary indicator of over (1) or under (0) age 60. To adjust for the uncontrolled confounding from AGE, let's refer to the appropriate bias model for a binary uncontrolled confounder: logit(P(U=1)) = α0 + α1X + α2Y + α2+jCj. To derive the necessary bias parameters, let's make the following assumption: * the odds of an age>60 is half as likely in smokers than non-smokers * the odds of an age>60 is 2.5x in those with CHD than those without CHD * the odds of an age>60 is 2x in those with HPT than those without HPT To convert these relationships as parameters in the model, we'll log-transform them from odds ratios. For the model intercept, we can use the following reasoning: what is the probability that a non-smoker (X=0) without CHD (Y=0) and HPT (C=0) is over age 60 in this population? We'll assume this is a 25% probability. We'll use the inverse logit function `qlogis()` from the `stats` package to convert this from a probability to the intercept coefficient of the logistic regression model. ```{r, eval = TRUE} u_0 <- qlogis(0.25) u_x <- log(0.5) u_y <- log(2.5) u_c <- log(2) ``` Now let's plug these bias parameters into `adjust_uc()` along with our `data_observed` object to obtain our bias-adjusted odds ratio. ```{r, eval = TRUE} df_obs <- data_observed( data = evans, exposure = "SMK", outcome = "CHD", confounders = "HPT" ) set.seed(1234) adjust_uc( df_obs, u_model_coefs = c(u_0, u_x, u_y, u_c) ) ``` We get an odds ratio of 2.3 (95% CI: 1.3, 4.1). This matches our expectation that the bias adjustment would pull the odds ratio away from the null. How does this result compare to the result we would get if age **wasn't** missing in the data and was incorporated in the outcome regression? ```{r, eval = TRUE} full_model <- glm(CHD ~ SMK + HPT + AGE, family = binomial(link = "logit"), data = evans ) or <- round(exp(coef(full_model)[2]), 2) or_ci_low <- round( exp(coef(biased_model)[2] - 1.96 * summary(full_model)$coef[2, 2]), 2 ) or_ci_high <- round( exp(coef(biased_model)[2] + 1.96 * summary(full_model)$coef[2, 2]), 2 ) print(paste0("Odds Ratio: ", or)) print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")")) ``` It turns out that our bias-adjusted odds ratio using `multibias` is close to this complete-data odds ratio of 2.3.