--- title: "BranchGLM Vignette" output: rmarkdown::html_vignette: toc: TRUE number_sections: TRUE vignette: > %\VignetteIndexEntry{BranchGLM Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Fitting GLMs - `BranchGLM()` allows fitting of gaussian, binomial, gamma, and Poisson GLMs with a variety of links available. - Parallel computation can also be done to speed up the fitting process, but it is only useful for larger datasets. ## Optimization methods - The optimization method can be specified, the default method is fisher scoring, but BFGS and L-BFGS are also available. - BFGS and L-BFGS typically perform better when there are many predictors in the model (at least 50 predictors), otherwise fisher scoring is typically faster. - The `grads` argument is for L-BFGS only and it is the number of gradients that are stored at a time and are used to approximate the inverse information. The default value for this is 10, but another common choice is 5. - The `tol` argument controls how strict the convergence criteria are, lower values of this will lead to more accurate results, but may also be slower. - The `method` argument is ignored for linear regression and the OLS solution is used. ## Initial values - Initial values for the coefficient estimates may be specified via the `init` argument. - If no initial values are specified, then the initial values are estimated via linear regression with the response variable transformed by the link function. ## Parallel computation - Parallel computation can be employed via OpenMP by setting the parallel argument to `TRUE` and setting the `nthreads` argument to the desired number of threads used. - For smaller datasets this can actually slow down the model fitting process, so parallel computation should only be used for larger datasets. # Families ## Gaussian - Permissible links for the gaussian family are - identity, which results in linear regression - inverse - log - square root (sqrt) - The most commonly used link function for the gaussian family is the identity link. - The dispersion parameter for this family is estimated by using the mean square error. ```{r} # Loading in BranchGLM library(BranchGLM) # Fitting gaussian regression models for mtcars dataset cars <- mtcars ## Identity link BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity") ``` ## Gamma - Permissible links for the gamma family are - identity - inverse, this is the canonical link for the gamma family - log - square root (sqrt) - The most commonly used link functions for the gamma family are inverse and log. - The dispersion parameter for this family is estimated via maximum likelihood, similar to the `MASS::gamma.dispersion()` function. ```{r} # Fitting gamma regression models for mtcars dataset ## Inverse link GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse") GammaFit ## Log link GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "log") GammaFit ``` ## Poisson - Permissible links for the Poisson family are - identity - log, this is the canonical link for the Poisson family - square root (sqrt) - The most commonly used link function for the Poisson family is the log link. - The dispersion parameter for this family is always 1. ```{r} # Fitting poisson regression models for warpbreaks dataset warp <- warpbreaks ## Log link BranchGLM(breaks ~ ., data = warp, family = "poisson", link = "log") ``` ## Binomial - Permissible links for the binomial family are - cloglog - log - logit, this is the canonical link for the binomial family - probit - The most commonly used link functions for the binomial family are logit and probit. - The dispersion parameter for this family is always 1. ```{r} # Fitting binomial regression models for toothgrowth dataset Data <- ToothGrowth ## Logit link BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") ## Probit link BranchGLM(supp ~ ., data = Data, family = "binomial", link = "probit") ``` ### Functions for binomial GLMs - **BranchGLM** has some utility functions for binomial GLMs - `Table()` creates a confusion matrix based on the predicted classes and observed classes - `ROC()` creates an ROC curve which can be plotted with `plot()` - `AUC()` and `Cindex()` calculate the area under the ROC curve - `MultipleROCCurves()` allows for the plotting of multiple ROC curves on the same plot #### Table ```{r} # Fitting logistic regression model for toothgrowth dataset catFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit") Table(catFit) ``` #### ROC ```{r} # Creating ROC curve catROC <- ROC(catFit) plot(catROC, main = "ROC Curve", col = "indianred") ``` #### Cindex/AUC ```{r} # Getting Cindex/AUC Cindex(catFit) AUC(catFit) ``` #### MultipleROCPlots ```{r, fig.width = 4, fig.height = 4} # Showing ROC plots for logit, probit, and cloglog probitFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", link = "probit") cloglogFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", link = "cloglog") MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), names = c("Logistic ROC", "Probit ROC", "Cloglog ROC")) ``` #### Using predictions - For each of the methods used in this section predicted probabilities and observed classes can also be supplied instead of the `BranchGLM` object. ```{r} preds <- predict(catFit) Table(preds, Data$supp) AUC(preds, Data$supp) ROC(preds, Data$supp) |> plot(main = "ROC Curve", col = "deepskyblue") ``` # Useful functions - **BranchGLM** has many utility functions for GLMs such as - `coef()` to extract the coefficients - `logLik()` to extract the log likelihood - `AIC()` to extract the AIC - `BIC()` to extract the BIC - `predict()` to obtain predictions from the fitted model - The coefficients, standard errors, Wald test statistics, and p-values are stored in the `coefficients` slot of the fitted model ```{r} # Predict method predict(GammaFit) # Accessing coefficients matrix GammaFit$coefficients ```