---
title: "Simulate from a fitted glmmTMB model or a formula"
author: "Mollie Brooks and Ben Bolker"
date: "`r format(Sys.Date(), '%d %b %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulate from a fitted glmmTMB model or a formula}
  %\VignettePackage{glmmTMB}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Simulating from a fitted model

`glmmTMB` has the capability to simulate from a fitted model. These simulations resample random effects from their estimated distribution. In future versions of `glmmTMB`, it may be possible to condition on estimated random effects.

```{r setup, include=FALSE, message=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
```

```{r libs,message=FALSE}
library(glmmTMB)
library(ggplot2); theme_set(theme_bw())
```

Fit a typical model:
```{r fit1}
data(Owls)
owls_nb1 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
                             (1|Nest)+offset(log(BroodSize)),
                          family = nbinom1,
                          ziformula = ~1, data=Owls)
```

Then we can simulate from the fitted model with the `simulate.glmmTMB` function. It produces a list of simulated observation vectors, each of which is the same size as the original vector of observations. The default is to only simulate one vector (`nsim=1`) but we still return a list for consistency.

```{r sim}
simo=simulate(owls_nb1, seed=1)
Simdat=Owls
Simdat$SiblingNegotiation=simo[[1]]
Simdat=transform(Simdat,  
			NegPerChick = SiblingNegotiation/BroodSize, 
			type="simulated")
Owls$type = "observed"	
Dat=rbind(Owls, Simdat)	
```

Then we can plot the simulated data against the observed data to check if they are similar. 
```{r plots,fig.width=7}

ggplot(Dat,  aes(NegPerChick, colour=type))+geom_density()+facet_grid(FoodTreatment~SexParent)
```

## Simulating from scratch (*de novo*)

what if you want to simulate data with specified parameters in the absence of a data set, for example for a power analysis?

`glmmTMB` has a `simulate_new` function that can handle this case; the hardest part is understanding the meaning of the parameter values, especially for random-effects covariances.


### example 1: linear regression

For the first example we'll simulate something that looks like the classic "sleep study" data, using the `sleepstudy` data set for structure and covariates. The conditional-fixed effects parameters (`beta`) are standard regression parameters (intercept and slope): we use 250 and 10, which are close to the values from the actual data. The only other parameter, `betadisp`, is the log of the dispersion parameter, which in the specific case of the Gaussian (default) family is the log of the conditional (residual) *variance*; the standard deviation from a simple regression on these data[^1] is approximately 50, so we use `2*log(50)`.

[^1]: I realize this violates the assumption of *de novo* simulation that we don't know what the real data looks like yet ...

```{r sleepstudy}
data("sleepstudy", package = "lme4")
set.seed(101)
ss_sim <- transform(sleepstudy,
                    Reaction = simulate_new(~ Days,
                                            newdata = sleepstudy,
                                            family = gaussian,
                                            newparams = list(beta = c(250, 10),
                                                        betadisp = 2*log(50)))[[1]])
```

For comparison, we'll also fit the model and use the built-in simulation method:

```{r simlm}
ss_fit <- glmmTMB(Reaction ~ Days, sleepstudy)
ss_simlm <- transform(sleepstudy,
                      Reaction = simulate(ss_fit)[[1]])
```

Comparing against the real data set:

```{r ss_plot, fig.width = 10}
library(ggplot2); theme_set(theme_bw())
ss_comb <- rbind(data.frame(sleepstudy, sample = "real"),
                 data.frame(ss_sim, sample = "simulated"),
                 data.frame(ss_simlm, sample = "simulated (from fit)")
                 )
ggplot(ss_comb, aes(x = Days, y = Reaction, colour = Subject)) +
    geom_line() +
    facet_wrap(~sample)
```

The simulated data have about the right variability, but in contrast to the real data have no among-subject variation.

### example 2: random effects (including correlations)

The next example will be more complex, getting into the nuts and bolts of how to translate random effects covariances into the terms that `glmmTMB` expects.

The hardest piece is probably translating correlation parameters. The "covariance structures" vignette has more details on how correlation matrices are parameterized, and the `put_cor()` function is a general translator from a specified correlation matrix (or its lower triangular elements) to the appropriate set of `theta` parameters. For the specific case of 2x2 correlation matrices (i.e. with a single correlation parameter), a correlation $\rho$ corresponds to a `glmmTMB` parameter of $\rho/\sqrt{1-\rho^2}$. Here's a utility function:

```{r rho-to-theta}
rho_to_theta <- function(rho) rho/sqrt(1-rho^2)
## tests
stopifnot(all.equal(get_cor(rho_to_theta(-0.2)), -0.2))
## equivalent to general function
stopifnot(all.equal(rho_to_theta(-0.2), put_cor(-0.2, input_val = "vec")))
```

Setting up metadata/covariates (tools in the `faux` package may also be useful for this):

```{r sim1}
n_id <- 10
dd <- expand.grid(trt = factor(c("A", "B")),
                  id = factor(1:n_id),
                  time = 1:6)
```


We'll set up some reasonable fixed effects. When in doubt about the order of fixed effect parameters, use `model.matrix()` to check:

```{r form}
form1 <- ~trt*time+(1+time|id)
colnames(model.matrix(lme4::nobars(form1), data = dd))
```

```{r params2}
## intercept, trtB effect, slope, trt x slope interaction
beta_vec <- c(1, 2, 0.1, 0.2)
```

We'll set SDs such that the average coeff var = 1 (SD = mean for
among-subject variation in intercept and slope). As described in
the "covstruct" vignette, the parameter vector for a random-effect
covariance consists of the log-standard-deviations followed by the
scaled correlations. Finally we'll set the dispersion parameter for
the negative binomial conditional distribution to 1 (more detail on
the `betadisp` parameterization for different families
is given in `?sigma.glmmTMB`).

```{r params3}
sdvec <-  c(1.5,0.15)
corval <- -0.2
thetavec <- c(log(sdvec), rho_to_theta(corval))
par1 <- list(beta = beta_vec,
             betadisp = log(1),  ## log(theta)
             theta = thetavec)
```

Now simulate:

```{r sim3}
dd$y <- simulate_new(form1,
                     newdata = dd,
                     seed = 101,
                     family = nbinom2,
                     newparams = par1)[[1]]
range(dd$y)
```

For comparison, we'll do this by hand (with some help from `lme4` machinery).
`lme4` parameterizes covariance matrices by the lower triangle of the Cholesky factor rather than using `glmmTMB`'s method ...

```{r sim-by-hand}
library(lme4)
set.seed(101)
X <- model.matrix(nobars(form1), data  = dd)
## generate random effects values
rt <- mkReTrms(findbars(form1),
               model.frame(subbars(form1), data = dd))
Z <- t(rt$Zt)
## construct covariance matrix
Sigma <- diag(sdvec) %*% matrix(c(1, corval, corval, 1), 2) %*% diag(sdvec)
lmer_thetavec <- t(chol(Sigma))[c(1,2,4)]
## plug values into Cholesky factor of random effects covariance matrix
rt$Lambdat@x <- lmer_thetavec[rt$Lind]
u <- rnorm(nrow(rt$Lambdat))
b <- t(rt$Lambdat) %*% u
eta <- drop(X %*% par1$beta + t(rt$Zt) %*% b)
mu <- exp(eta)
y <- rnbinom(nrow(dd), size = 1, mu = mu)
range(y) ## range varies a lot
```

Alternatively we could have generated the random effects with:

```{r mvrnorm}
b <- MASS::mvrnorm(1, mu = rep(0,2*n_id),
                   Sigma = Matrix::.bdiag(replicate(n_id,
                                                    Sigma,
                                                    simplify = FALSE)))
```

### example 3: AR1 model

We will simulate a single time series with AR1 structure, with a nugget (measurement error) variance $\sigma^2_n = 1.0$, an autoregressive variance of $\sigma^2_a = 1$, and an autoregressive parameter of $\phi = 0.7$,

First by brute force, using the code from the "covariance structures" vignette:

```{r acf1}
set.seed(101)
n <- 1000                                                 ## Number of time points
x <- MASS::mvrnorm(mu = rep(0,n),
                   Sigma = .7 ^ as.matrix(dist(1:n)) )    ## Simulate the process using the MASS package
## as.matrix(dist(1:n)) constructs a banded matrix with m_{ij} = abs(i-j)
y <- x + rnorm(n)                                         ## Add measurement noise/nugget
dat0 <- data.frame(y, 
                 times = factor(1:n, levels=1:n),
                 group = factor(rep(1, n)))
```

Now using `simulate_new()` with matching parameters `beta = 0` (the only fixed effect is the intercept, which we set to zero); `betadisp = 0` (the log-variance of the measurement error [note Gaussian family uses log-variance rather than log-SD parameterization, although in this case it doesn't make any difference ...]); `theta[1] = 0` (log-SD of autoregressive variance); and `theta[2]` specifying a correlation parameter $\phi = 0.7$ (see "Covariance structures" vignette for details).

```{r sim_new_ar1}
phi_to_AR1 <- function(phi) phi/sqrt(1-phi^2)
s2 <- simulate_new(~ ar1(times + 0 | group), 
                   newdata = dat0,
                   seed = 101,
                   newparams = list(
                       beta = 0,   
                       betadisp = 0,
                       theta = c(0, phi_to_AR1(0.7)))
                   )[[1]]
```

With a nugget variance of $\sigma^2_n = 1.0$, an autoregressive variance of $\sigma^2_a = 1$, and an autoregressive parameter of $\phi = 0.7$, we expect the ACF to be $\sigma^2_a/(\sigma^2_a + \sigma^2_n) \phi^d$ .

```{r plot_acf}
a1 <- drop(acf(dat0$y, plot = FALSE)$acf)
a2 <- drop(acf(s2, plot = FALSE)$acf)
par(las = 1, bty = "l")
matplot(0:(length(a1)-1), cbind(a1, a2), pch = 1,
        ylab = "autocorrelation", xlab = "lag")
curve(0.7^x/2, add = TRUE, col = 4, lwd = 2)
```

The precise curves are different (because the multivariate normal deviates are generated in a different way),
but the ACFs match.

## FIXME/TO DO

* more examples! especially more complex/unavailable in `lme4` (spatial, ZI, etc.). If necessary, more details on parameterizations (shape/scale for spatial cov structures, etc.)