---
title: "GLMMselect: Bayesian model selection for generalized linear mixed models"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
author: Shuangshuang Xu, Marco A. R. Ferreira, Erica M. Porter, and Christopher T. Franck
vignette: >
  %\VignetteIndexEntry{GLMMselect: Bayesian model selection for generalized linear mixed models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(GLMMselect)
```

# Introduction

The `GLMMselect` package provides a novel Bayesian model selection approach for the analysis of Poisson and binary data. `GLMMselect` contains functions to simultaneously select fixed effects and random effects for non-Gaussian data. In the `GLMMselect` package, we use pseudo likelihood method to solve the problem that the random effects cannot be analytically integrated out of GLMMs. In addition, we develop a fractional Bayes factor (FBF) approach to obtain posterior probabilities of the competing models. Full details on the methods implemented in `GLMMselect` can be found in the manuscript (Xu et al. 202X).

This vignette contains a estimated example for count data and a case study presented in the manuscript (Xu et al. 202X) to illustrate how `GLMMselect` works. For the simulated example, the count data are simulated from Poisson GLMM with four candidate fixed effects and two types of random effects (spatial random effects and overdispersion random effects).


# Function

The function `GLMMselect()` implemented in the `GLMMselect` package is described below:

* Function `GLMMselect()` performs the ARM and HCM (Xu et al. 202X) which performs model selection for generalized linear mixed models, given a numeric vector of binary or count data `Y`, a matrix of covariates `X`, a list of covariance matrices for random effects `Sigma`, a list of design matrices for random effects `Z`, the description of the error distribution `family`, and the prior for variance component of random effects `prior`. Arguments `pip_fixed` and `pip_random` are the thresholds that if the posterior inclusion probability of fixed effects or random effects is larger than `pip_fixed` or `pip_random`, the corresponding fixed effects and random effects will be included in the best model. In addition, the argument `NumofModel` can adjust the number of models with the largest posterior probabilities which are printed out in the output. The function `GLMMselect()` returns a list of the indices of fixed effects and random effects that are identified in the best model, a table of models' posterior probabilities (MPP) which are sorted in decreasing order, a table for fixed effects' inference, and a table for random effects' inference which includes point estimates and standard deviations for coefficients, and posterior inclusion probabilities (PIP) for each effect. 


# Model

The generalized linear mixed model used in the `GLMMselect` package is
$$ g(\pmb{Y})=X\pmb{\beta}+\sum_{q=1}^Q Z_q \pmb{\alpha}_q,$$
where

* $g()$ is the link function.
* $\pmb{Y}$ is an $n \times 1$ vector of observations, either binary data or count data.
* $X$ is the matrix of covariates.
* $\pmb{\beta}$ is the $p \times 1$ vector of regression coefficients.
* $Z_q$ is an incidence matrix relating the vector of random effects $\pmb{\alpha}_q$ to the observations.
* $\pmb{\alpha}_q$ is a vector of random effects with covariance matrix $\tau_q \Sigma_q$. $\Sigma_q$ is a known semi-positive definite matrix, and $\tau_q$ is an unknown scalar. The common types of $\pmb{\alpha}_q$ can be spatial random effects, longitudinal random effects and overdispersion random effects. 

Currently, the `GLMMselect` package can analyze binary responses (`family = "bernoulli"`) and Poisson responses (`family = "poisson"`). The manuscript that develops the methods in GLMMselect (Xu et al. 202X) provides details on the priors for $\pmb{\beta}$ and $\tau_q$, as well as details about the FBF approach used by GLMMselect. 

# Examples

The `GLMMselect()` function requires a vector of observations (either Bernoulli or assumed Poisson distributed), a matrix of covariates, a list of design matrices for random effects, and a list of covariance matrices for random effects. 

The vector of observations must be a numeric $n \times 1$ vector. In the GLMMselect package, there is an example simulated from the Poisson generalized linear mixed model with four candidate covariates, a vector of spatial random effects, and a vector of overdispersion random effects. Here are the first five elements of the Poisson simulated vector of observations:

```{r}
data("Y")
Y[1:5]
```

The covariate matrix passed to the function contains all candidate covariates. Each column corresponds to a candidate covariate. Each row corresponds to an observation. In this example, the covariate matrix contains four candidate covariates. The covariates in the first two columns are in the true model. Here are the first five rows of the covariate matrix:

```{r}
data("X")
X[1:5,]
```

The design matrices for candidate random effects are passed to the GLMMselect function as a list. In this example, this is a list of two design matrices for two types of random effects. The first component is for spatial random effects. The second component is for overdispersion random effects. Here are the first five rows and the first five columns for each of these design matrices:

```{r}
data("Z")
Z[[1]][1:5,1:5]
Z[[2]][1:5,1:5]
```

The covariance matrices for the candidate random effects are also passed to the GLMMselect function as a list. In this example, this is a list of two covariance matrices for two types of random effects. The first component is for spatial random effects. The second componet is for overdispersion random effects. Here are the first five rows and the first five columns for each of these covariance matrices:

```{r}
data("Sigma")
Sigma[[1]][1:5,1:5]
Sigma[[2]][1:5,1:5]
```

Data `Y`,`X`, `Sigma`, and `Z` above are attached in the `GLMMselect` package. 

# Example: Analysis of the simulated dataset
 
In this example, we use `GLMMselect` to analyze Poisson count data with an approximate reference (AR) prior for the variance components of random effects.

```{r}
Model_selection_output <- GLMMselect(Y=Y, X=X, Sigma=Sigma, Z=Z, 
                                     family="poisson", prior="AR", offset=NULL)
Model_selection_output$BestModel
Model_selection_output$PosteriorProb
Model_selection_output$FixedEffect
Model_selection_output$RandomEffect
```

`GLMMselect()` outputs the indices of the covariates and the indices of the random effects which are in the best model. The true model contains the first two covariates and spatial random effects. `GLMMselect` correctly identifies these two covariates and the spatial random effects. 


# Example: Analysis of Scotland lip cancer data

Here is a case study to help illustrate how to use `GLMMselect`. This dataset provides the number of male lip cancer cases in the 56 counties of Scotland during the period 1975-1980, as well as the percentage of the work force employed in agriculture, fishing, or forestry (AFF) as a covariate. The model we consider is
$$
y_i|\mu_i \stackrel{ind}{\sim} \mbox{Poisson}(\mu_i), \ \ \ i=1\dots 56, \\
\log(\mu_i) = \log(n_i)+\pmb{x}_i^T\pmb{\beta}+\alpha_{1i}+\alpha_{2i}, \\
\pmb{\alpha}_1 \sim N(\pmb{0},\tau_1 \Sigma_1), \\
\pmb{\alpha}_2 \sim N(\pmb{0}, \tau_2 \Sigma_2). 
$$

$n_i$ is the expected number of lip cancer cases in the $i^{th}$ county, which are assumed to be known constants. $\pmb{\alpha}_1$ is a vector of spatial random effects. $\pmb{\alpha}_2$ is a vector of overdispersion random effects.

In this example, we use `GLMMselect` to analyze lip cancer data with a half Cauchy (HC) prior for variance components of random effects. Data `lipcancer_Y`,`lipcancer_X`, `lipcancer_Sigma`, and `lipcancer_Z` are attached in the `GLMMselect` package. 

```{r}
lip_cancer_output <- GLMMselect(Y=lipcancer_Y, X=lipcancer_X, Sigma=lipcancer_Sigma, Z=lipcancer_Z,
                                family="poisson", prior="HC", offset=lipcancer_offset)
lip_cancer_output$BestModel
lip_cancer_output$PosteriorProb
lip_cancer_output$FixedEffect
lip_cancer_output$RandomEffect
```


# See also

`ref.ICAR` provides an objective Bayesian approach for modeling spatially correlated areal data using an intrinsic conditional autoregressive prior on a vector of spatial random effects. 

# Reference

Xu, Shuangshuang, Ferreira, M. A. R., Porter, E. M., and Franck, C. T. (202X).  Bayesian Model Selection for Generalized Linear Mixed Models, Biometrics, under review.