\documentclass[12pt]{article}
%% vignette index specifications need to be *after* \documentclass{}
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{basic examples of glmmTMB usage}
%\VignettePackage{glmmTMB}
%\VignetteDepends{ggplot2}
%\VignetteDepends{grid}
%\VignetteDepends{bbmle}
%\VignetteDepends{mlmRev}
%\usepackage{lineno}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage[american]{babel}
\newcommand{\R}{{\sf R}}
\newcommand{\fixme}[1]{\textbf{\color{red} fixme: #1}}
\newcommand{\notimpl}[1]{\emph{\color{magenta} #1}}
\usepackage{url}
\usepackage{hyperref}
\usepackage{alltt}
\newcommand{\code}[1]{{\tt #1}}
\usepackage{fancyvrb}
\usepackage{natbib}
\VerbatimFootnotes
\bibliographystyle{chicago}
%% need these for output of citation() below ...
\newcommand{\bold}[1]{\textbf{#1}}
%% this part is no longer needed, hard-coded citation now
%% from Sebastian Meyer: needs LaTeX >= 2020-10-01
%% (probably? don't need to worry about backward compatibility,
%%  people who need to build the vignette can update?)
%% \NewCommandCopy\Rhref\href

\title{Getting started with the \code{glmmTMB} package}
\author{Ben Bolker}
\date{\today}
\begin{document}
\maketitle

%\linenumbers

<<setopts,echo=FALSE,message=FALSE, eval = TRUE>>=
library("knitr")
opts_chunk$set(fig.width=5,fig.height=5,
               out.width="0.8\\textwidth",
               echo = TRUE, error = FALSE,
               eval = identical(Sys.getenv("NOT_CRAN"), "true"))
Rver <- paste(R.version$major,R.version$minor,sep=".")
used.pkgs <- c("glmmTMB","bbmle")  ## packages to report below
@

\section{Introduction/quick start}

\code{glmmTMB} is an R package built on
the \href{https://github.com/kaskr/adcomp}{Template Model Builder} automatic differentiation engine, for fitting generalized
linear mixed models and extensions.

(Not-yet-implemented features are denoted \notimpl{like this})

\begin{itemize}
  \item response distributions: Gaussian, binomial, beta-binomial, Poisson, negative binomial (NB1 and NB2 parameterizations), Conway-Maxwell-Poisson, generalized Poisson, Gamma, Beta, Tweedie; as well as zero-truncated Poisson, Conway-Maxwell-Poisson, generalized Poisson, and negative binomial; Student $t$. See \code{?family\_glmmTMB} for a current list including details of parameterizations.
  \item link functions: log, logit, probit, complementary log-log, inverse, identity
  \item zero-inflation with fixed and random-effects components; hurdle models via truncated Poisson/NB
  \item single or multiple (nested or crossed) random effects
  \item offsets
  \item fixed-effects models for dispersion
  \item diagonal, compound-symmetric, or unstructured random effects variance-covariance matrices; first-order autoregressive (AR1) variance structures
\end{itemize}

In order to use \code{glmmTMB} effectively you should already
be reasonably familiar with generalized linear mixed models
(GLMMs), which in turn requires familiarity with (i) generalized
linear models (e.g. the special cases of logistic, binomial,
and Poisson regression) and (ii) `modern' mixed models (those
working via maximization of the marginal likelihood rather
than by manipulating sums of squares).   \cite{bolker_generalized_2009} and \cite{bolker_glmm_2014} are reasonable starting
points in this area (especially geared to biologists and less-technical readers), as are \cite{zuur_mixed_2009}, \cite{millar_maximum_2011}, and \cite{zuur_beginners_2013}.

In order to fit a model in \code{glmmTMB} you need to:
\begin{itemize}
\item specify a model for the conditional effects, in the standard
  R (Wilkinson-Rogers) formula notation (see \code{?formula}
  or Section 11.1 of the \href{http://cran.r-project.org/doc/manuals/R-intro.pdf}{Introduction to R}.
  Formulae can also include \emph{offsets}.
\item specify a model for the random effects, in the notation
  that is common to the \code{nlme} and \code{lme4} packages.
  Random effects are specified as \code{x|g}, where \code{x}
  is an effect and \code{g} is a grouping factor (which must
  be a factor variable, or a nesting of/interaction among factor variables).
  For example, the formula would be \code{1|block} for a random-intercept model
  or \code{time|block} for a model with random variation in slopes
  through time across groups specified by \code{block}. A model
  of nested random effects (block within site) could be
  \code{1|site/block} if block labels are reused across multiple sites, or \code{(1|site)+ (1|block)} if the nesting structure is explicit in the data and each level of block only occurs within one site. A model of crossed random effects
  (block and year) would be \code{(1|block)+(1|year)}.
  \item choose the error distribution by specifying the family (\code{family} argument). In general, you can specify the function (\code{binomial}, \code{gaussian}, \code{poisson}, \code{Gamma} from base R, or one of the options listed at \code{family\_glmmTMB} [\code{nbinom2}, \code{beta\_family()}, \code{betabinomial}, \ldots])).
  \item choose the error distribution by specifying the family (\code{family} argument). You can choose among
    \begin{itemize}
    \item Distributions defined in base R and documented in \code{?family}, \emph{except} the inverse Gaussian and quasi- families
    \item Distributions defined in \code{?glmmTMB::family\_glmmTMB}
    \end{itemize}
\item optionally specify a zero-inflation model (via the \code{ziformula} argument) with fixed and/or random effects
\item optionally specify a dispersion model with fixed effects
\end{itemize}

This document was
generated using \Sexpr{Rver} and package versions:
<<pkgversions, echo=FALSE, eval=TRUE>>=
pkgver <- vapply(sort(used.pkgs),function(x) as.character(packageVersion(x)),"")
print(pkgver,quote=FALSE)
@

The current citation for \code{glmmTMB} is:
\begin{quote}
Brooks ME, Kristensen K, van
Benthem KJ, Magnusson A, Berg CW, Nielsen A, Skaug HJ, Maechler M, Bolker BM (2017).
``glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.''
\emph{The R Journal}, \bold{9}(2), 378--400.
\href{https://doi.org/10.32614/RJ-2017-066}{doi:10.32614/RJ-2017-066}.
% was: echo=FALSE, results="asis"
<<citation,eval=FALSE,echo=FALSE>>=
print(citation("glmmTMB"),style="latex")
@
\end{quote}

\section{Preliminaries: packages and data}

Load required packages:
<<pkgs,message=FALSE, eval=TRUE>>=
library("glmmTMB")
library("bbmle")    ## for AICtab
library("ggplot2")
## cosmetic
theme_set(theme_bw()+
  theme(panel.spacing=grid::unit(0,"lines")))
@

The data, taken from \cite{zuur_mixed_2009} and ultimately
from \cite{roulinbersier_2007}, quantify
the number of negotiations among owlets (owl chicks)
in different nests \emph{prior} to the arrival
of a provisioning parent as a function of food treatment
(deprived or satiated), the sex of the parent, and
arrival time.  The total number of calls from the
nest is recorded, along with the total brood size, which
is used as an offset to allow the use of a Poisson response.

Since the same nests are measured repeatedly, the nest is used as
a random effect.
The model can be expressed as a zero-inflated generalized
linear mixed model (ZIGLMM).

Various small manipulations of the data set:
(1) reorder nests by mean negotiations per chick, for plotting
purposes; (2) add log brood size variable (for offset);
(3) rename response variable and abbreviate one of the input variables.
%% FIXME: I get a warning message ("NAs introduced by coercion")  here, but only in knitr,
%%  and not on a clean start ... ?
%% some weird package interaction ?
<<owltransform,warning=FALSE>>=
Owls <- transform(Owls,
                  Nest=reorder(Nest,NegPerChick),
                  NCalls=SiblingNegotiation,
                  FT=FoodTreatment)
@
(If you were really using this data set you should start
with \code{summary(Owls)} to explore the data set.)

% fig.cap="Basic view of owl data from \\cite{roulinbersier_2007}."
<<owlplot1,echo=FALSE,message=FALSE,warning=FALSE,eval=FALSE>>=
G0 <- ggplot(Owls,aes(x=reorder(Nest,NegPerChick),
                      y=NegPerChick))+
  labs(x="Nest",y="Negotiations per chick")+coord_flip()+
  facet_grid(FoodTreatment~SexParent)
G0+stat_sum(aes(size=..n..),alpha=0.5)+
      scale_size_continuous(name="# obs",
                            breaks=seq(1,9,by=2))+
    theme(axis.title.x=element_text(hjust=0.5,size=12),
         axis.text.y=element_text(size=7))
@

We should explore the data before we start to
build models, e.g. by plotting it in
various ways, but this vignette is about \code{glmmTMB},
not about data visualization \ldots

Now fit some models:

The basic \code{glmmTMB} fit --- a zero-inflated Poisson model with a single zero-inflation parameter applying to all observations (\verb+ziformula~1+).  (Excluding zero-inflation is \code{glmmTMB}'s default:
to exclude it explicitly, use \verb+ziformula~0+.)

<<glmmTMBfit>>=
fit_zipoisson <- glmmTMB(NCalls~(FT+ArrivalTime)*SexParent+
                                     offset(log(BroodSize))+(1|Nest),
                                     data=Owls,
                                     ziformula=~1,
                                     family=poisson)
@

<<zipoisssum>>=
summary(fit_zipoisson)
@

We can also try a standard zero-inflated negative binomial model;
the default is the ``NB2'' parameterization (variance = $\mu(1+\mu/k)$: \cite{hardin_generalized_2007}).
To use families (Poisson, binomial, Gaussian) that are defined
in \R, you should specify them as in \code{?glm} (as a string
referring to the family function, as the family function itself,
or as the result of a call to the family function: i.e.
\code{family="poisson"}, \code{family=poisson},
\code{family=poisson()}, and \code{family=poisson(link="log")}
are all allowed and all equivalent (the log link is the default
for the Poisson family).  Some of the additional families that
are \emph{not} defined in base R (at this point \code{nbinom2}
and \code{nbinom1}) can be specified using the same format.
Otherwise, for families that are implemented in \code{glmmTMB}
but for which \code{glmmTMB} does not provide a function,
you should specify the \code{family} argument
as a list containing (at least) the (character) elements \code{family}
and \code{link}, e.g. \code{family=list(family="nbinom2",link="log")}.
(In order to be able to retrieve Pearson (variance-scaled) residuals
from a fit, you also need to specify a \code{variance} component;
see \code{?family\_glmmTMB}.)

<<glmmTMBnbinomfit>>=
fit_zinbinom <- update(fit_zipoisson,family=nbinom2)
@

%% FIXME: caching may lead to
%% ## Error in ICtab(..., mnames = mnames, type = "AIC"): memory block of size 3.1 Gb
%% downstream, in AICtab() ...
%% for now I'm removing caching, but we should
%%   (1) document this as an issue/make a MWE
%%   (2) fix it
%%   (3) we could also cache the AICtab chunk as well ..

Alternatively, we can use an ``NB1'' fit (variance = $\phi \mu$).
<<glmmTMBnbinom1fit>>=
fit_zinbinom1 <- update(fit_zipoisson,family=nbinom1)
@

Relax the assumption that total number of calls is strictly proportional
to brood size (i.e. using log(brood size) as an offset):
<<glmmTMBnbinom1vfit>>=
fit_zinbinom1_bs <- update(fit_zinbinom1,
                           . ~ (FT+ArrivalTime)*SexParent+
                               BroodSize+(1|Nest))
@

Every change we have made so far improves the fit --- changing distributions
improves it enormously, while changing the role of brood size makes only
a modest (-1 AIC unit) difference:
<<aictab>>=
AICtab(fit_zipoisson,fit_zinbinom,fit_zinbinom1,fit_zinbinom1_bs)
@

\subsection{Hurdle models}

In contrast to zero-inflated models, hurdle models treat zero-count
and non-zero outcomes as two completely separate categories, rather than
treating the zero-count outcomes as a mixture of structural and
sampling zeros.

\code{glmmTMB} includes truncated
Poisson and negative binomial familes and hence can fit hurdle models.

<<glmmTMBnbinomhfit>>=
fit_hnbinom1 <-  update(fit_zinbinom1_bs,
                        ziformula=~.,
                        data=Owls,
                        family=truncated_nbinom1)
@

Then we can use \code{AICtab} to compare all the models.

<<hurdle_AIC>>=
AICtab(fit_zipoisson,fit_zinbinom,
       fit_zinbinom1,fit_zinbinom1_bs,
       fit_hnbinom1)
@

\section{Sample timings}

To get a rough idea of \code{glmmTMB}'s speed relative to
\code{lme4} (the most commonly used mixed-model package for R),
we try a few standard problems, enlarging the data sets by
cloning the original data set (making multiple copies and
sticking them together).

<<contraception_sum,echo=FALSE, eval=TRUE>>=
data("Contraception",package="mlmRev")
nc <- nrow(Contraception)
nl <- length(levels(Contraception$district))
load("contraceptionTimings.rda")
meandiff <- mean(with(tmatContraception,
                      time[pkg=="glmer"]/time[pkg=="glmmTMB"]))
@

Figure~\ref{fig:contraception} shows the results of
replicating the \code{Contraception} data set
(\Sexpr{nc} observations, \Sexpr{nl} levels in the
random effects grouping level) from 1 to 40 times.
\code{glmmADMB} is sufficiently slow ($\approx 1$ minute for
a single copy of the data) that we didn't try replicating very much.
On average, \code{glmmTMB} is about \Sexpr{round(meandiff,1)} times
faster than \code{glmer} for this problem.

<<contraception,echo=FALSE,warning=FALSE,fig.cap="Timing for fitting the replicated Contraception data set.">>=
## NaN from geom_smooth because glmmADMB only has two points/
## can't compute confidence intervals
## suppressWarnings() doesn't actually work within ggplot ...
## instead set/reset options("warn")
op <- options(warn = -1)
ggplot(tmatContraception,
       aes(n, time, colour=pkg)) + geom_point() +
    scale_y_log10(breaks=c(1,2,5,10,20,50,100)) +
    scale_x_log10(breaks=c(1,2,4,10,20,40)) +
    labs(x="Replication (x 1934 obs.)",y="Elapsed time (s)") +
    geom_smooth(method="lm", formula = y ~ x) +
    scale_colour_brewer(palette="Set1")
options(op)
@


<<insteval,echo=FALSE,warning=FALSE,fig.cap="Timing for fitting subsets of the InstEval data set.", eval = TRUE>>=
load("InstEvalTimings.rda")
n_InstEval <- 73421L  ## seems silly to require lme4 just to get this number
meandiff_inst2 <- with(tmatInstEval,
     time[pkg=="lmer"]/time[pkg=="glmmTMB"])
ggplot(tmatInstEval,aes(n,time,colour=pkg))+geom_point()+
  scale_y_log10(breaks=c(1,2,5,10,20,50,100,200))+
      scale_x_log10(breaks=c(0.1,0.2,0.5,1.0))+
  labs(x=sprintf("Replication (x %d obs.)",n_InstEval),
       y="Elapsed time (s)")+
  geom_smooth(method="lm", formula = y ~ x)+
  scale_colour_brewer(palette="Set1")
@

Figure~\ref{fig:insteval} shows equivalent timings for the
\code{InstEval} data set, although in this case since the
original data set is large (\Sexpr{n_InstEval} observations)
we subsample the data set rather than cloning it:
in this case, the advantage is reversed
and \code{lmer} is about \Sexpr{round(1/mean(meandiff_inst2,1))} times
faster.

In general, we expect \code{glmmTMB}'s advantages over \code{lme4} to
be (1) greater flexibility (zero-inflation etc.); (2) greater speed for
GLMMs, especially those with large number of ``top-level'' parameters
(fixed effects plus random-effects variance-covariance parameters).
In contrast, \code{lme4} should be faster for LMMs (for maximum
speed, you may want to check the \href{https://github.com/dmbates/MixedModels.jl}{MixedModels.jl} package for Julia); \code{lme4} is more mature and at present has a wider variety of diagnostic checks and methods for using model results, including downstream packages.

\bibliography{glmmTMB}
\end{document}