--- title: "Simulate from a fitted glmmTMB model or a formula" author: "Mollie Brooks and Ben Bolker" date: "`r format(Sys.Date(), '%d %b %Y')`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulate from a fitted glmmTMB model or a formula} %\VignettePackage{glmmTMB} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ## Simulating from a fitted model `glmmTMB` has the capability to simulate from a fitted model. These simulations resample random effects from their estimated distribution. In future versions of `glmmTMB`, it may be possible to condition on estimated random effects. ```{r setup, include=FALSE, message=FALSE} library(knitr) knitr::opts_chunk$set(echo = TRUE) ``` ```{r libs,message=FALSE} library(glmmTMB) library(ggplot2); theme_set(theme_bw()) ``` Fit a typical model: ```{r fit1} data(Owls) owls_nb1 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent + (1|Nest)+offset(log(BroodSize)), family = nbinom1, ziformula = ~1, data=Owls) ``` Then we can simulate from the fitted model with the `simulate.glmmTMB` function. It produces a list of simulated observation vectors, each of which is the same size as the original vector of observations. The default is to only simulate one vector (`nsim=1`) but we still return a list for consistency. ```{r sim} simo=simulate(owls_nb1, seed=1) Simdat=Owls Simdat$SiblingNegotiation=simo[[1]] Simdat=transform(Simdat, NegPerChick = SiblingNegotiation/BroodSize, type="simulated") Owls$type = "observed" Dat=rbind(Owls, Simdat) ``` Then we can plot the simulated data against the observed data to check if they are similar. ```{r plots,fig.width=7} ggplot(Dat, aes(NegPerChick, colour=type))+geom_density()+facet_grid(FoodTreatment~SexParent) ``` ## Simulating from scratch (*de novo*) what if you want to simulate data with specified parameters in the absence of a data set, for example for a power analysis? `glmmTMB` has a `simulate_new` function that can handle this case; the hardest part is understanding the meaning of the parameter values, especially for random-effects covariances. ### example 1: linear regression For the first example we'll simulate something that looks like the classic "sleep study" data, using the `sleepstudy` data set for structure and covariates. The conditional-fixed effects parameters (`beta`) are standard regression parameters (intercept and slope): we use 250 and 10, which are close to the values from the actual data. The only other parameter, `betadisp`, is the log of the dispersion parameter, which in the specific case of the Gaussian (default) family is the log of the conditional (residual) *variance*; the standard deviation from a simple regression on these data[^1] is approximately 50, so we use `2*log(50)`. [^1]: I realize this violates the assumption of *de novo* simulation that we don't know what the real data looks like yet ... ```{r sleepstudy} data("sleepstudy", package = "lme4") set.seed(101) ss_sim <- transform(sleepstudy, Reaction = simulate_new(~ Days, newdata = sleepstudy, family = gaussian, newparams = list(beta = c(250, 10), betadisp = 2*log(50)))[[1]]) ``` For comparison, we'll also fit the model and use the built-in simulation method: ```{r simlm} ss_fit <- glmmTMB(Reaction ~ Days, sleepstudy) ss_simlm <- transform(sleepstudy, Reaction = simulate(ss_fit)[[1]]) ``` Comparing against the real data set: ```{r ss_plot, fig.width = 10} library(ggplot2); theme_set(theme_bw()) ss_comb <- rbind(data.frame(sleepstudy, sample = "real"), data.frame(ss_sim, sample = "simulated"), data.frame(ss_simlm, sample = "simulated (from fit)") ) ggplot(ss_comb, aes(x = Days, y = Reaction, colour = Subject)) + geom_line() + facet_wrap(~sample) ``` The simulated data have about the right variability, but in contrast to the real data have no among-subject variation. ### example 2: random effects (including correlations) The next example will be more complex, getting into the nuts and bolts of how to translate random effects covariances into the terms that `glmmTMB` expects. The hardest piece is probably translating correlation parameters. The "covariance structures" vignette has more details on how correlation matrices are parameterized, and the `put_cor()` function is a general translator from a specified correlation matrix (or its lower triangular elements) to the appropriate set of `theta` parameters. For the specific case of 2x2 correlation matrices (i.e. with a single correlation parameter), a correlation $\rho$ corresponds to a `glmmTMB` parameter of $\rho/\sqrt{1-\rho^2}$. Here's a utility function: ```{r rho-to-theta} rho_to_theta <- function(rho) rho/sqrt(1-rho^2) ## tests stopifnot(all.equal(get_cor(rho_to_theta(-0.2)), -0.2)) ## equivalent to general function stopifnot(all.equal(rho_to_theta(-0.2), put_cor(-0.2, input_val = "vec"))) ``` Setting up metadata/covariates (tools in the `faux` package may also be useful for this): ```{r sim1} n_id <- 10 dd <- expand.grid(trt = factor(c("A", "B")), id = factor(1:n_id), time = 1:6) ``` We'll set up some reasonable fixed effects. When in doubt about the order of fixed effect parameters, use `model.matrix()` to check: ```{r form} form1 <- ~trt*time+(1+time|id) colnames(model.matrix(lme4::nobars(form1), data = dd)) ``` ```{r params2} ## intercept, trtB effect, slope, trt x slope interaction beta_vec <- c(1, 2, 0.1, 0.2) ``` We'll set SDs such that the average coeff var = 1 (SD = mean for among-subject variation in intercept and slope). As described in the "covstruct" vignette, the parameter vector for a random-effect covariance consists of the log-standard-deviations followed by the scaled correlations. Finally we'll set the dispersion parameter for the negative binomial conditional distribution to 1 (more detail on the `betadisp` parameterization for different families is given in `?sigma.glmmTMB`). ```{r params3} sdvec <- c(1.5,0.15) corval <- -0.2 thetavec <- c(log(sdvec), rho_to_theta(corval)) par1 <- list(beta = beta_vec, betadisp = log(1), ## log(theta) theta = thetavec) ``` Now simulate: ```{r sim3} dd$y <- simulate_new(form1, newdata = dd, seed = 101, family = nbinom2, newparams = par1)[[1]] range(dd$y) ``` For comparison, we'll do this by hand (with some help from `lme4` machinery). `lme4` parameterizes covariance matrices by the lower triangle of the Cholesky factor rather than using `glmmTMB`'s method ... ```{r sim-by-hand} library(lme4) set.seed(101) X <- model.matrix(nobars(form1), data = dd) ## generate random effects values rt <- mkReTrms(findbars(form1), model.frame(subbars(form1), data = dd)) Z <- t(rt$Zt) ## construct covariance matrix Sigma <- diag(sdvec) %*% matrix(c(1, corval, corval, 1), 2) %*% diag(sdvec) lmer_thetavec <- t(chol(Sigma))[c(1,2,4)] ## plug values into Cholesky factor of random effects covariance matrix rt$Lambdat@x <- lmer_thetavec[rt$Lind] u <- rnorm(nrow(rt$Lambdat)) b <- t(rt$Lambdat) %*% u eta <- drop(X %*% par1$beta + t(rt$Zt) %*% b) mu <- exp(eta) y <- rnbinom(nrow(dd), size = 1, mu = mu) range(y) ## range varies a lot ``` Alternatively we could have generated the random effects with: ```{r mvrnorm} b <- MASS::mvrnorm(1, mu = rep(0,2*n_id), Sigma = Matrix::.bdiag(replicate(n_id, Sigma, simplify = FALSE))) ``` ### example 3: AR1 model We will simulate a single time series with AR1 structure, with a nugget (measurement error) variance $\sigma^2_n = 1.0$, an autoregressive variance of $\sigma^2_a = 1$, and an autoregressive parameter of $\phi = 0.7$, First by brute force, using the code from the "covariance structures" vignette: ```{r acf1} set.seed(101) n <- 1000 ## Number of time points x <- MASS::mvrnorm(mu = rep(0,n), Sigma = .7 ^ as.matrix(dist(1:n)) ) ## Simulate the process using the MASS package ## as.matrix(dist(1:n)) constructs a banded matrix with m_{ij} = abs(i-j) y <- x + rnorm(n) ## Add measurement noise/nugget dat0 <- data.frame(y, times = factor(1:n, levels=1:n), group = factor(rep(1, n))) ``` Now using `simulate_new()` with matching parameters `beta = 0` (the only fixed effect is the intercept, which we set to zero); `betadisp = 0` (the log-variance of the measurement error [note Gaussian family uses log-variance rather than log-SD parameterization, although in this case it doesn't make any difference ...]); `theta[1] = 0` (log-SD of autoregressive variance); and `theta[2]` specifying a correlation parameter $\phi = 0.7$ (see "Covariance structures" vignette for details). ```{r sim_new_ar1} phi_to_AR1 <- function(phi) phi/sqrt(1-phi^2) s2 <- simulate_new(~ ar1(times + 0 | group), newdata = dat0, seed = 101, newparams = list( beta = 0, betadisp = 0, theta = c(0, phi_to_AR1(0.7))) )[[1]] ``` With a nugget variance of $\sigma^2_n = 1.0$, an autoregressive variance of $\sigma^2_a = 1$, and an autoregressive parameter of $\phi = 0.7$, we expect the ACF to be $\sigma^2_a/(\sigma^2_a + \sigma^2_n) \phi^d$ . ```{r plot_acf} a1 <- drop(acf(dat0$y, plot = FALSE)$acf) a2 <- drop(acf(s2, plot = FALSE)$acf) par(las = 1, bty = "l") matplot(0:(length(a1)-1), cbind(a1, a2), pch = 1, ylab = "autocorrelation", xlab = "lag") curve(0.7^x/2, add = TRUE, col = 4, lwd = 2) ``` The precise curves are different (because the multivariate normal deviates are generated in a different way), but the ACFs match. ## FIXME/TO DO * more examples! especially more complex/unavailable in `lme4` (spatial, ZI, etc.). If necessary, more details on parameterizations (shape/scale for spatial cov structures, etc.)