---
title: "GenEst - 2. Tutorial with Solar Example"
author: "J Mintz, D Dalthorp, J Simonis"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{GenEst - 2. Tutorial with Solar Example}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
require(rmarkdown)
require(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

rm(list = ls())

library(GenEst)
vers <- packageVersion("GenEst")
today <- Sys.Date()
set.seed(951)
```

### Introduction

In this vignette we walk through an example illustrating how **GenEst** command 
line utilities could be used to estimate mortality for different size birds at a
large field of solar photovoltaic collectors.  Our objective is to estimate 
overall mortality, as well as how mortality varies over time, whether it 
constant throughout the facility, and finally how different size classes of
birds are affected.  

The general steps in the analysis are:

1. Construct a model for Searcher Efficiency
2. Construct a model for Carcass Persistance
3. Estimate mortality
4. Specify the type of summary desired (for example, by season and species)

There are five files in total which make up the example dataset. For
convenience, these files can be accessed in R as a list:

```{r}
library(GenEst)
data(solar_PV)
names(solar_PV)
```
Alternatively, the files may be downloaded as .csv files from
`https://code.usgs.gov/ecosystems/GenEst/-/releases` in the "For more info"
section.

### Part 1: Searcher Efficiency Modeling

Searcher efficiency (SE) is modeled as a function of the number of times a 
carcass has been missed in previous searches and any number of covariates. 
The probability of finding a carcass that is present at the time of search is `p`
on the first search after carcass arrival and is assumed to decrease by a factor
of `k` each time the carcass is missed in searches.  (For further background on
field trials, and information about how to format the results for use with 
GenEst, see the User Guide, which is available at the `code.usgs.gov` page cited
above).

Results of the SE field trials used in this example are stored in the `data_SE`
data frame:

```{r pk data}
data_SE <- solar_PV$SE
head(data_SE)
```

GenEst provides tools to construct and compare specific individual models, to 
explore which subsets of variables are most useful, and to automatically 
construct entire sets of models.  To start we will fit a basic model in which 
the probability of detecting a carcass, `p`, and compounding difficulty to 
detect, `k`, depend only on their respective intercepts (and not other factors 
such as season or size).  The function `pkm` is used to create a searcher 
efficiency model, which is returned as a `pkm` object.

```{r pk one model}
SE_model <- pkm(p ~ 1, k ~ 1, data = data_SE)
SE_model

```
To explore whether use of covariate is warranted, `pkm` is used with the
`allCombos = TRUE`. The specified model will be fit as will models formed
using all combinations of predictors listed for the `p` and `k` parameters.
orginal model.  For example, `p ~ Season` can be simplified into `p ~ 1`, our
original model in which `p` is independent of season.  

```{r pk two models}
SE_model_set <- pkm(p~Season, k~1, data = data_SE, allCombos = TRUE)
class(SE_model_set)
length(SE_model_set)
names(SE_model_set)
class(SE_model_set[[1]])
```
The set of models is contained in a `pkmSet` object.  We could inspect the two 
models stored in the `pkmSet` individually, or for convenience we can view the 
AICc values simultaneously for all models using the \code{aicc} function.
Summary plots can be obtained by plotting any of the individual objects or the
set as well.

```{r pk set AICc}
aicc(SE_model_set)
```

Rather than one searcher efficiency model for all birds, it is often preferable
to fit a seperate model for each size class.  The `sizeCol` argument of the
`pkm` function is the name of the column in `data_SE` that gives the size class
for each carcass in the SE trials. If a `sizeCol` is provided, `pkm` returns a
list of separate pk models fit for each size class.

```{r pk size set}
SE_size_model <- pkm(p ~ Season,
                     k ~ 1,
                     sizeCol = "Size",
                     data = data_SE)
class(SE_size_model)
names(SE_size_model)  # A list is created with a model set per size class.
class(SE_size_model$small)
names(SE_size_model$small) # Each model set contains one model in this case.

```
To fit all combinations of models for each size class, use `pkm` with a `sizeCol`
parameter and with `allCombos = T`.

Once we have decided on which models to use for each size class, we store the
corresponding pkm objects in a list for future use.  In this case, we will 
choose the models with the lower AICc.
```{r}
SE_size_model_set <- pkm(p ~ Season,
                     k ~ 1,
                     sizeCol = "Size",
                     data = data_SE, allCombos = TRUE)
aicc(SE_size_model_set)
SE_models <- list()
```
Size small:
```{r}
SE_models$small <- SE_size_model_set$small[[2]]
```
Size Medium:
```{r pk size Medium}
SE_models$med <- SE_size_model_set$med[[2]]
```

Size Large:
```{r pk Size Large}
SE_models$lrg <- SE_size_model_set$lrg[[1]]
```


### Part 2: Carcass Persistence Modeling
A carcass persistence model estimates the amount of time a carcass would persist
for, given the conditions under which it arrived.  A number of carcasses have 
been placed in the field and periodically checked for scavanging.  Results of 
the CP field trials used in this example are stored in the `data_CP` data frame:

```{r cp data}
data_CP <- solar_PV$CP
head(data_CP)
```

\code{LastPresent} and \code{FirstAbsent} represent the left (start) and right
(end) endpoints of the interval over which a carcass went missing.  For further
information about CP trials and how to format results for use with GenEst, see
the User Guide (link found on help menu of the GUI, which can be accessed by
entering `runGenEst()` from the R console).

Four classes of parameteric models may be used for carcass persistance:
exponential, Weibull, logistic, and lognormal.  As with Searcher Efficiency we
can fit one specific model, test a set of covariates and choose our favorite 
single model, or fit seperate models dependent on size class.   First we will 
fit a single Weibull models for all birds.  Weibull distributions have two 
parameters, location and scale.  We will specify that the location depends on 
season by setting `l ~ season`, but scale only depends on the intercept using 
`s ~ 1`.

```{r cp}
cpm(l ~ Season, s ~ 1, data = data_CP,
                left = "LastPresent",
                right = "FirstAbsent",
                dist = "weibull")
```

Next, we try a CP model set considering whether the `season` covariate for
location is necessary, by comparing the `l ~ season, s ~ 1` to `l ~ 1, s ~ 1`.
```{r cp set}
  CP_weibull_set <- cpm(l ~ Season, s ~ 1, data = data_CP,
                  left = "LastPresent",
                  right = "FirstAbsent",
                  dist = "weibull", allCombos = TRUE)
class(CP_weibull_set)
aicc(CP_weibull_set)
```

Finally we will construct sets of CP models for each size class, however this 
time we will also consider models based on both exponential and weibull 
distributions.  To compare models for multiple distributions, set `dist` to a
vector of the distribution names to be considered.  With a `sizeCol` provided
and `allCombos = TRUE`, `cpm` returns a list of `cpmSet` objects, one for
each size class.
```{r cp Size Set}
CP_size_model_set <- cpm(formula_l = l ~ Season,
                           formula_s = s ~ 1, 
                           left = "LastPresent",
                           right = "FirstAbsent",
                           dist = c("exponential", "weibull"),
                           sizeCol = "Size",
                           data = data_CP, allCombos = TRUE)
class(CP_size_model_set)
names(CP_size_model_set)
class(CP_size_model_set$small)
length(CP_size_model_set$small)
names(CP_size_model_set$small)
```

We now have the flexibility to select models from different families for 
different size classes.  We will choose to use the models with lower AICc, which
requires storing the corresponding `cpm` objects in a list for later use.  


```{r}
aicc(CP_size_model_set)
CP_models <- list()
```
Size small:
```{r cp Size Small}
CP_models$small <- CP_size_model_set$small[[4]]
```

Size med:
```{r cp size Medium}
CP_models$med <- CP_size_model_set$med[[4]]
```

Size lrg:
```{r Size Large}
CP_models$lrg <- CP_size_model_set$lrg[[2]]
```


### Part 3: Mortality Estimation
Estimating mortality requires bringing together models, carcass observation 
data (CO), and information on how the data was gathered.  In particular the
search schedule (SS) and proportion of carcasses in searchable areas (the 
density weighted proportion, or DWP), are needed.  We will breifly inspect the 
files.  Further information on the formatting of the CO, SS, and DWP files can 
be found in the User Guide.

Carcass observations:
```{r Load CO SS and DWP}
data_CO <- solar_PV$CO
head(data_CO)
```
Search schedule:
```{r SS Data}
data_SS <- solar_PV$SS
data_SS[1:5 , 1:10]
```
(Note that there are 300 arrays columns altogether: Unit1, ..., Unit300)

Density weighted proportion:
```{r DWP data}
data_DWP <- solar_PV$DWP
head(data_DWP)
```

These elements combine in the function `estM`, producing an object containing 
simulated arrival, detection, and mortality distributions.  We also have the 
opportunity to provide the fraction of the facility being surveyed, `frac`, if 
it happens to be less than 100\%.  Increasing the number of simulations, `nsim`, 
will improve the accuracy of the estimates but comes at a cost of computer
runtime.

When estimating mortality, it is not currently possible to mix CP and SE models 
which differ in their dependence on size.  Either both models depend on size 
class, or both models must be independent of size class.  In this case we will 
choose here to use size dependence.

```{r Arrival Times, options}
  Mest <- estM(
    nsim = 100, frac = 1, 
    data_CO = data_CO, data_SS = data_SS, data_DWP = data_DWP,
    model_SE = SE_models, model_CP = CP_models,
    unitCol = "Unit", sizeCol = "Size",
    COdate = "DateFound", SSdate = "DateSearched"
  )
```

We are now able to get a confidence interval for estimated total mortality by
taking summary of the estM object.  Plotting it shows us us the estimated 
probability density for number of fatalities.


```{r, fig.show = "hold", fig.height = 4, fig.width = 6, fig.align = 'center'}
plot(Mest)
```
A point estimate for overall sitewide mortality is listed at the top of the 
plot, satisfying our first objective.  The period of inference only covers the 
period over which we have have fatality monitoring data, which in this case is 
from `r min(data_SS$DateSearched)` to `r max(data_SS$DateSearched)`.


### Part 4: Summaries

Having calculated the estimated arrival densities for each of the carcases,
we can now use them to create a variety of summaries.  Suppose that we are 
interested in how mortality changes with respect to three kinds of variables:

1. Temporally - by season, or finer resolutions
2. Spatially - summaries by search unit
3. Among Size classes, or other groups

To create summaries, we split the data by differnt covariates, using a function 
called `calcSplits`.  This requires the simulated mortality `$Mhat` and arrival 
times `$Aj` stored in the `estM` object, plus the search schedule and carcass 
observation data.

Splits to the search schedule (splits in time) are specified by assigning a 
covariate to `split_SS`.  These must be variables present in the Search Schedule
file.  To investigate differences in mortality between season, we will set 
`split_SS` to `Season`.

```{r Summary - Season}
unique(data_SS[, "Season"])

M_season <- calcSplits(M = Mest,
  split_SS = "Season", data_SS = data_SS,
  split_CO = NULL,  data_CO = data_CO
)
```

Splitting the estM creates a `splitFull` object, a plot of which shows boxplots
for each season.

```{r splitFull plot, fig.height = 4, fig.width = 4, fig.align = 'center'}
plot(M_season)
```

Taking a summary of the `splitFull` object gives us a confidence interval for
each level of the split covariate.  The size of the confidence interval can be
specified for both plots or summaries using the CL argument.

```{r SplitFull Summary}
summary(M_season, CL = 0.95)
```

To get a finer summary of mortality, we need to parse the search schedule, using
the function `prepSS`.  This allows us to specify the exact time intervals over
which we will split, in this case we will create a weekly summary.

```{r Summary - Weekly}

SSdat <- prepSS(data_SS) # Creates an object of type prepSS.
schedule <- seq(from = 0, to = max(SSdat$days), by = 7)
tail(schedule)
```

When we plot the splitFull object for a split with a custom schedule, we must 
specify that the rate is per split catagory by setting `rate = T`.
```{r Summary - Weekly Part 2, fig.height = 4, fig.width = 7, fig.align = 'center'}
M_week <- calcSplits(M = Mest,
  split_time = schedule,
  data_SS = SSdat,
  data_CO = data_CO
)
plot(x = M_week, rate = TRUE)

```

Next we will look at at splitting by covariates present in the Carcass 
Observation file.  We specify a CO split by assigning split_CO to the name (or 
names) of the variables we wish to split on.  Suppose we would like a summary of
estimated mortality by unit.  
```{r Summary - Unit, fig.height = 4, fig.width = 7, fig.align = 'center'}

M_unit <- calcSplits(M = Mest,
  split_CO = "Unit",
  data_CO = data_CO,
  data_SS = data_SS
)
plot(M_unit, rate = FALSE)
```
There are 300 units in this example, each one gets a boxplot when we plot the 
splitFull.   For those arrays which have at least one observation, we can create
a summary.  In this case we will only create a summary for arrays 8 and 100.

```{r individual unit summary}
dim(summary(M_unit))  # only 164 arrays had observations.

# A list of the arrays without observed carcasses:
setdiff(paste0("Unit", 1:300), data_CO$Unit)

# Create summaries for arrays Unit12 and Unit100.
whichRow <- rownames(summary(M_unit))  %in% c("Unit12", "Unit100")
summary(M_unit)[whichRow, ]

```

It is possible to create summaries that split on both Carcass Observation 
variables and Search Schedule variables.  To do so, include both a `split_SS`
and a `split_CO` argument.

```{r Summary - season and species, fig.height = 5, fig.width = 3, fig.align = 'center'}

M_unit_and_species <- calcSplits(M = Mest,
  split_SS = c("Season"),
  split_CO = c("Size"),
  data_CO = data_CO,
  data_SS = data_SS
)
plot(M_unit_and_species, rate = FALSE)

```

Two CO variables can be compared simultaneously by specifying an ordered pair of
covariates for `split_CO`, however currently there are a limited total number 
(two) of splits which can be allocated among temporal or carcass covariates.