Simulating datasets

Ken Kellner

July 7, 2024

1 Introduction

Simulating datasets is a powerful and varied tool when conducting unmarked analyses. Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results. If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the parboot function does. This can be helpful, for example, when testing goodness of fit. Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available.

unmarked provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process.

  1. Generating simulated datasets from a fitted model we already have
  2. Generating simulated datasets from scratch

For (1), we simply call the simulate method on our fitted model object and new dataset(s) are generated. This is the approach used by parboot. In this vignette we will focus on (2), a more flexible approach to simulation, also using the simulate method, that allows us to generate a dataset corresponding to any unmarked model from scratch.

2 Components of a call to simulate

We will need to provide several pieces of information to simulate in order to simulate a dataset from scratch in unmarked.

  1. An unmarkedFrame of the appropriate type defining the desired experimental design
  2. The model to use (model), if the unmarkedFrame is used for multiple model types
  3. Other arguments required by the fitting function for which we are simulating. Most importantly this will include formulas for each submodel.
  4. A named list of parameter or coefficient values (coefs) controlling the simulation, which correspond to the formula(s) specified earlier.

3 Simulating an occupancy dataset

The easiest way to demonstrate how to use simulate is to look at an example: we’ll start with a simple one for occupancy.

3.1 1. The unmarkedFrame

Suppose we want to simulate a single-season occupancy dataset in which site occupancy is affected by elevation. The first step is to create an unmarkedFrame object of the appropriate type, which defines the experimental design and includes any covariates we want to use in the simulation. Since we want to simulate an occupancy dataset, we’ll create an unmarkedFrameOccu.

The unmarkedFrameOccu function takes three arguments: the observation matrix y, the site covariates siteCovs, and the observation-level covariates obsCovs. The dimensions of y define how many sites and replicate samples the study includes. We’ll create a blank y matrix (i.e., filled with NAs) of dimension 300 x 8, indicating we want our study to have 300 sites and 8 sampling occasions. The values you put in this y matrix don’t matter, you can put anything in there you want as they’ll be overwritten with the simulated values later. It’s only used to define the number of sites and occasions.

library(unmarked)
set.seed(123)
M <- 300
J <- 8
y <- matrix(NA, M, J)

Earlier we said we want to include an elevation covariate, so we’ll simulate the covariate now and add it to a data frame. We could create several covariates here, including factors, etc.

site_covs <- data.frame(elev = rnorm(M))

We’re not using any observation covariates, so we can now make the complete unmarkedFrameOccu:

umf <- unmarkedFrameOccu(y = y, siteCovs = site_covs)
head(umf)
## Data frame representation of unmarkedFrame object.
##    y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8        elev
## 1   NA  NA  NA  NA  NA  NA  NA  NA -0.56047565
## 2   NA  NA  NA  NA  NA  NA  NA  NA -0.23017749
## 3   NA  NA  NA  NA  NA  NA  NA  NA  1.55870831
## 4   NA  NA  NA  NA  NA  NA  NA  NA  0.07050839
## 5   NA  NA  NA  NA  NA  NA  NA  NA  0.12928774
## 6   NA  NA  NA  NA  NA  NA  NA  NA  1.71506499
## 7   NA  NA  NA  NA  NA  NA  NA  NA  0.46091621
## 8   NA  NA  NA  NA  NA  NA  NA  NA -1.26506123
## 9   NA  NA  NA  NA  NA  NA  NA  NA -0.68685285
## 10  NA  NA  NA  NA  NA  NA  NA  NA -0.44566197

3.2 2. Specify the model type

Since unmarkedFrameOccu is used by both the single-season occupancy model (occu) and the Royle-Nichols occupancy model (occuRN), we need to tell unmarked which one to use.

model <- occu

Most unmarkedFrame types in unmarked are used by only one model fitting function, so this step is often unnecessary.

3.3 3. Specify other arguments to the fitting function

Take a look at the help file for occu. When fitting a single-season occupancy model we need to provide, in addition to the data, the formula argument defining the model structure. We’ll need to provide these same argument(s) to simulate. Many fitting functions will have multiple required arguments, such as the mixture distribution to use, key functions, etc.

Here we specify a double right-hand-side formula as required by occu, specifying an effect of elevation on occupancy.

form <- ~1~elev

3.4 4. Specify the corresponding parameter values

The model structure, as defined by the formula above, implies a certain set of parameter/coefficient values (intercepts, slopes) we need to supply to simulate. These need to be supplied as a named list, where each list element corresponds to one submodel (such as state for occupancy and det for detection). Each list element is a numeric vector of the required parameter values. It can be tricky to figure out the structure of this list, so simulate allows you to not include it at first, in which case the function will return a template for you to fill in.

simulate(umf, model = model, formula = form)
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $state
## (Intercept)        elev
##           0           0
##
## $det
## (Intercept)
##           0
## Error: Specify coefs argument as shown above

We need to supply a list with two elements state and det. The state element contains two values, the intercept and the slope corresponding to elevation. The det element contains only the intercept since we have no covariates on detection. Note that all values supplied in this list must be on the inverse link scale, which will depend on the specific submodel used. So for example, a value of 0 for det implies a detection probability of 0.5, because we’re using the logit link function.

plogis(0)
## [1] 0.5

Now let’s make our own coefs list:

cf <- list(state = c(0, -0.4), det = 0)

Here we’re setting a negative effect of elevation on occupancy.

3.5 Run simulate

We now have all the pieces to simulate a dataset.

out <- simulate(umf, model = occu, formula = ~1~elev, coefs = cf)
## Assumed parameter order for state:
## (Intercept), elev
## Assumed parameter order for det:
## (Intercept)

The result is always a list of unmarkedFrames. By default, we just get one, but we can get more with the nsim argument.

head(out[[1]])
## Data frame representation of unmarkedFrame object.
##    y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8        elev
## 1    1   1   1   1   1   1   1   1 -0.56047565
## 2    0   0   0   0   0   0   0   0 -0.23017749
## 3    0   0   0   0   0   0   0   0  1.55870831
## 4    0   0   0   0   0   0   0   0  0.07050839
## 5    0   0   0   0   0   0   0   0  0.12928774
## 6    1   0   0   0   0   0   0   0  1.71506499
## 7    0   0   0   0   0   0   0   0  0.46091621
## 8    0   0   0   0   0   0   0   0 -1.26506123
## 9    1   1   1   0   1   0   1   0 -0.68685285
## 10   1   0   0   1   1   0   0   0 -0.44566197

The simulated unmarkedFrame now contains y values and is ready to use.

3.6 Fit a model to the simulated dataset

As a quick check, let’s fit a model to our simulated dataset.

occu(~1~elev, data = out[[1]])
## 
## Call:
## occu(formula = ~1 ~ elev, data = out[[1]])
## 
## Occupancy:
##             Estimate    SE     z  P(>|z|)
## (Intercept)   -0.146 0.120 -1.22 0.223241
## elev          -0.572 0.136 -4.20 0.000027
## 
## Detection:
##  Estimate     SE      z P(>|z|)
##    -0.038 0.0612 -0.621   0.535
## 
## AIC: 1929.538

We get out roughly the same parameters that we put in, as expected.

4 Simulating a more complex dataset: gdistremoval

The gdistremoval function fits the model of Amundson et al. (2014), which estimates abundance using a combination of distance sampling and removal sampling data. When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses.

4.1 1. The unmarkedFrame

First create the appropriate type of unmarkedFrame, which is unmarkedFrameGDR. There’s two y-matrices: one for distance sampling and one for removal sampling. We’ll create a dataset with 4 distance bins and 5 removal periods.

set.seed(123)
M <- 100
Jdist <- 4
Jrem <- 5

y_dist <- matrix(NA, M, Jdist)
y_rem <- matrix(NA, M, Jrem)

We’ll create an elevation site covariate and a wind observation covariate. Observation-level covariates are only used by the removal part of the model, so they should have the same number of values as y_rem.

site_covs <- data.frame(elev = rnorm(M))
obs_covs <- data.frame(wind = rnorm(M * Jrem))

Finally we can create the unmarkedFrameGDR. We’ll also need to specify the distance bins and the units for the distance part of the model here. See ?unmarkedFrameGDR for more information.

umf <- unmarkedFrameGDR(yRem = y_rem, yDist = y_dist, siteCovs = site_covs, obsCovs = obs_covs,
                        dist.breaks = c(0,25,50,75,100), unitsIn = 'm')
head(umf)
## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 2       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 3       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 4       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 5       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 6       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 7       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 8       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 9       NA      NA      NA      NA     NA     NA     NA     NA     NA
## 10      NA      NA      NA      NA     NA     NA     NA     NA     NA
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048

4.2 2. Arguments sent to gdistremoval

Looking at ?gdistremoval, required arguments include lambdaformula, removalformula, and distanceformula. We need to set these formula values to control the simulation. We’ll also use the negative binomial distribution for abundance.

lambdaformula <- ~elev # elevation effect on abundance
removalformula <- ~wind # wind effect on removal p
distanceformula <- ~1
mixture <- "NB"

4.3 3. Coefficient values

As in the previous section, we’ll leave the coefs argument blank at first and get the correct output structure.

simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
         mixture="NB")
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $lambda
## (Intercept)        elev
##           0           0
##
## $alpha
## alpha
##     0
##
## $dist
## (Intercept)
##           0
##
## $rem
## (Intercept)        wind
##           0           0
## Error: Specify coefs argument as shown above

We need to set two values for the abundance (lambda) model on the log scale, one for dist which represents the distance function sigma parameter (log scale), one for the negative binomial dispersion parameter alpha (log scale), and two for the removal detection probability model (logit scale).

We’ll pick the (relatively arbitrary) values below:

cf <- list(lambda = c(log(5), 0.7),
           dist = log(50),
           alpha = 0.1,
           rem = c(-1, -0.3))

4.4 4. Run simulation

Now provide everything to simulate. Note we don’t need to provide the model argument because unmarkedFrameGDR is used for only one fitting function (gdistremoval).

We’ll simulate 2 datasets.

out <- simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
                coefs=cf, mixture="NB", nsim=2)
## Assumed parameter order for lambda:
## (Intercept), elev
## Assumed parameter order for alpha:
## alpha
## Assumed parameter order for dist:
## (Intercept)
## Assumed parameter order for rem:
## (Intercept), wind
lapply(out, head)
## [[1]]
## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1        0       1       0       0      1      0      0      0      0
## 2        1       0       1       2      1      1      1      0      1
## 3        4       6       7       3      7      3      5      2      3
## 4        1       0       1       0      0      0      0      1      1
## 5        0       0       1       0      1      0      0      0      0
## 6        0       0       0       0      0      0      0      0      0
## 7        3       3       0       1      1      0      1      3      2
## 8        0       0       0       0      0      0      0      0      0
## 9        0       0       0       0      0      0      0      0      0
## 10       0       0       0       0      0      0      0      0      0
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048
## 
## [[2]]
## Data frame representation of unmarkedFrame object.
##    yDist.1 yDist.2 yDist.3 yDist.4 yRem.1 yRem.2 yRem.3 yRem.4 yRem.5
## 1        0       1       0       0      1      0      0      0      0
## 2        0       1       0       0      0      1      0      0      0
## 3        2       1       2       1      4      2      0      0      0
## 4        0       0       0       0      0      0      0      0      0
## 5        1       1       0       1      1      1      0      1      0
## 6        0       0       0       0      0      0      0      0      0
## 7        1       1       1       0      1      0      2      0      0
## 8        0       0       0       0      0      0      0      0      0
## 9        0       0       0       1      1      0      0      0      0
## 10       0       0       0       0      0      0      0      0      0
##           elev      wind.1     wind.2      wind.3      wind.4      wind.5
## 1  -0.56047565 -0.71040656  0.2568837 -0.24669188 -0.34754260 -0.95161857
## 2  -0.23017749 -0.04502772 -0.7849045 -1.66794194 -0.38022652  0.91899661
## 3   1.55870831 -0.57534696  0.6079643 -1.61788271 -0.05556197  0.51940720
## 4   0.07050839  0.30115336  0.1056762 -0.64070601 -0.84970435 -1.02412879
## 5   0.12928774  0.11764660 -0.9474746 -0.49055744 -0.25609219  1.84386201
## 6   1.71506499 -0.65194990  0.2353866  0.07796085 -0.96185663 -0.07130809
## 7   0.46091621  1.44455086  0.4515041  0.04123292 -0.42249683 -2.05324722
## 8  -1.26506123  1.13133721 -1.4606401  0.73994751  1.90910357 -1.44389316
## 9  -0.68685285  0.70178434 -0.2621975 -1.57214416 -1.51466765 -1.60153617
## 10 -0.44566197 -0.53090652 -1.4617556  0.68791677  2.10010894 -1.28703048

4.5 Fit model to simulated dataset

As a check, we’ll fit the same model used for simulation to one of the datasets.

gdistremoval(lambdaformula=~elev, removalformula=~wind, distanceformula=~1, data=out[[1]],
             mixture="NB")
## 
## Call:
## gdistremoval(lambdaformula = ~elev, removalformula = ~wind, distanceformula = ~1, 
##     data = out[[1]], mixture = "NB")
## 
## Abundance:
##             Estimate    SE     z  P(>|z|)
## (Intercept)    1.651 0.152 10.89 1.28e-27
## elev           0.887 0.142  6.25 4.16e-10
## 
## Dispersion:
##  Estimate    SE     z P(>|z|)
##     0.118 0.242 0.486   0.627
## 
## Distance:
##  Estimate    SE    z P(>|z|)
##      3.89 0.053 73.4       0
## 
## Removal:
##             Estimate    SE     z  P(>|z|)
## (Intercept)   -0.901 0.141 -6.38 1.74e-10
## wind          -0.373 0.088 -4.24 2.20e-05
## 
## AIC: 1162.882

Looks good.

5 Conclusion

The simulate function provides a flexible tool for simulating data from any model in unmarked. These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with unmarked. Additionally, simulating datasets is a key component of the power analysis workflow in unmarked - see the power analysis vignette for more examples.

References

Amundson, C. L., J. A. Royle, and C. M. Handel. 2014. A hierarchical model combining distance sampling and time removal to estimate detection probability during avian point counts. The Auk 131:476–494.