---
title: "BranchGLM Vignette"
output: 
  rmarkdown::html_vignette:
    toc: TRUE
    number_sections: TRUE

vignette: >
  %\VignetteIndexEntry{BranchGLM Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Fitting GLMs

- `BranchGLM()` allows fitting of gaussian, binomial, gamma, and Poisson GLMs with a variety of links available.  
- Parallel computation can also be done to speed up the fitting process, but it is only useful for larger datasets.

## Optimization methods

- The optimization method can be specified, the default method is fisher scoring, but BFGS and L-BFGS are also available. 
- BFGS and L-BFGS typically perform better when there are many predictors in the model (at least 50 predictors), otherwise fisher scoring is typically faster.  
- The `grads` argument is for L-BFGS only and it is the number of gradients that are stored at a time and are used to approximate the inverse information. The default value for this is 10, but another common choice is 5.
- The `tol` argument controls how strict the convergence criteria are, lower values of this will lead to more accurate results, but may also be slower.
- The `method` argument is ignored for linear regression and the OLS solution is 
used.

## Initial values

- Initial values for the coefficient estimates may be specified via the `init` 
argument.
- If no initial values are specified, then the initial values are estimated 
via linear regression with the response variable transformed by the link function.

## Parallel computation

- Parallel computation can be employed via OpenMP by setting the parallel argument 
to `TRUE` and setting the `nthreads` argument to the desired number of threads used.
- For smaller datasets this can actually slow down the model fitting process, so 
parallel computation should only be used for larger datasets.

# Families 
## Gaussian

- Permissible links for the gaussian family are 
  - identity, which results in linear regression
  - inverse
  - log
  - square root (sqrt)
- The most commonly used link function for the gaussian family is the identity link.
- The dispersion parameter for this family is estimated by using the mean square 
error.

```{r}
# Loading in BranchGLM
library(BranchGLM)

# Fitting gaussian regression models for mtcars dataset
cars <- mtcars

## Identity link
BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")

```

## Gamma

- Permissible links for the gamma family are 
  - identity
  - inverse, this is the canonical link for the gamma family
  - log
  - square root (sqrt)
- The most commonly used link functions for the gamma family are inverse and log.
- The dispersion parameter for this family is estimated via maximum likelihood,  
similar to the `MASS::gamma.dispersion()` function.

```{r}
# Fitting gamma regression models for mtcars dataset
## Inverse link
GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse")
GammaFit

## Log link
GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "log")
GammaFit

```

## Poisson

- Permissible links for the Poisson family are 
  - identity
  - log, this is the canonical link for the Poisson family
  - square root (sqrt)
- The most commonly used link function for the Poisson family is the log link.
- The dispersion parameter for this family is always 1.

```{r}
# Fitting poisson regression models for warpbreaks dataset
warp <- warpbreaks

## Log link
BranchGLM(breaks ~ ., data = warp, family = "poisson", link = "log")

```

## Binomial

- Permissible links for the binomial family are 
  - cloglog
  - log
  - logit, this is the canonical link for the binomial family
  - probit
- The most commonly used link functions for the binomial family are logit and probit.
- The dispersion parameter for this family is always 1.

```{r}
# Fitting binomial regression models for toothgrowth dataset
Data <- ToothGrowth

## Logit link
BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit")

## Probit link
BranchGLM(supp ~ ., data = Data, family = "binomial", link = "probit")

```

### Functions for binomial GLMs

- **BranchGLM** has some utility functions for binomial GLMs
  - `Table()` creates a confusion matrix based on the predicted classes and observed classes
  - `ROC()` creates an ROC curve which can be plotted with `plot()`
  - `AUC()` and `Cindex()` calculate the area under the ROC curve
  - `MultipleROCCurves()` allows for the plotting of multiple ROC curves on the same plot

#### Table

```{r}
# Fitting logistic regression model for toothgrowth dataset
catFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit")

Table(catFit)

```

#### ROC

```{r}
# Creating ROC curve
catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

```

#### Cindex/AUC

```{r}
# Getting Cindex/AUC
Cindex(catFit)

AUC(catFit)

```

#### MultipleROCPlots

```{r, fig.width = 4, fig.height = 4}
# Showing ROC plots for logit, probit, and cloglog
probitFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", 
                       link = "probit")

cloglogFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", 
                       link = "cloglog")

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), 
                  names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

```

#### Using predictions

- For each of the methods used in this section predicted probabilities and observed
classes can also be supplied instead of the `BranchGLM` object.

```{r}

preds <- predict(catFit)

Table(preds, Data$supp)

AUC(preds, Data$supp)

ROC(preds, Data$supp) |> plot(main = "ROC Curve", col = "deepskyblue")

```


# Useful functions

- **BranchGLM** has many utility functions for GLMs such as
  - `coef()` to extract the coefficients
  - `logLik()` to extract the log likelihood
  - `AIC()` to extract the AIC
  - `BIC()` to extract the BIC
  - `predict()` to obtain predictions from the fitted model
- The coefficients, standard errors, Wald test statistics, and p-values are stored in the `coefficients` slot of the fitted model

```{r}
# Predict method
predict(GammaFit)

# Accessing coefficients matrix
GammaFit$coefficients

```