For count response variables, the glm framework has two options. The Poisson family and the negative binomial family. In this vignette, we will consider both and learn when to use one or the other.
First lets get an idea of how the Poisson family looks. If lambda is near 1, the Poisson family is right skewed. As lambda increases, the Poisson family becomes more symmetrical.
library(stats)
library(MASS)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(GlmSimulatoR)
set.seed(1)
# lambda 1
poisson <- rpois(n = 10000, lambda = 1)
poisson_df <- as_tibble(poisson)
ggplot(poisson_df, aes(value)) +
geom_histogram(bins = 100)
Lets create data and train a model.
set.seed(1)
simdata <- simulate_poisson(N = 10000, weights = c(.5, 1), link = "log")
# Response looks similar to above histograms
ggplot(simdata, aes(Y)) +
geom_histogram(bins = 100)
glm_poisson <- glm(Y ~ X1 + X2, data = simdata, family = poisson(link = "log"))
summary(glm_poisson)
##
## Call:
## glm(formula = Y ~ X1 + X2, family = poisson(link = "log"), data = simdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.996422 0.014963 66.59 <2e-16 ***
## X1 0.506335 0.006625 76.43 <2e-16 ***
## X2 0.996936 0.006816 146.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 38086 on 9999 degrees of freedom
## Residual deviance: 10164 on 9997 degrees of freedom
## AIC: 60925
##
## Number of Fisher Scoring iterations: 4
The estimated weights are close the the weights argument in simulate_poisson. The model estimates are accurate.
For modeling, the main difference between Poisson and the negative binomial is the extra parameter. For the Poisson distribution, the mean-variance relationship is \(\sigma^2 =\mu\). For the negative binomial distribution, the mean-variance relationship is \(\sigma^2 = \mu + \mu^2/\theta\). Through \(\theta\), a more flexible relationships is possible.
When \(\theta\) is large, \(\mu + \mu^2/\theta\) is roughly equal to \(\mu\) . Therefore the negative binomial will be similar to Poisson distribution.
set.seed(1)
poisson <- rpois(n = 10000, lambda = 1)
poisson_df <- as_tibble(poisson)
ggplot(poisson_df, aes(value)) +
geom_histogram(bins = 100)
neg_bin_rv <- rnegbin(n = 10000, mu = 1, theta = 1000)
neg_bin_df <- as_tibble(neg_bin_rv)
ggplot(neg_bin_df, aes(value)) +
geom_histogram(bins = 100)
If \(\theta\) is small, \(\mu + \mu^2/\theta\) does not approximately equal \(\mu\). The negative binomial will look different than the Poisson distribution. Note the difference in scale on the x axis and y axis.
Lets create data where the flexibility of the negative binomial is needed and compare the Poisson estimates to the negative binomial estimates.
set.seed(1)
simdata <- simulate_negative_binomial(
N = 10000, weights = c(.5, 1),
ancillary = 5, link = "log"
) # ancillary is theta.
# Response looks like a negative binomial distribution.
ggplot(simdata, aes(Y)) +
geom_histogram(bins = 200)
glm_poisson <- glm(Y ~ X1 + X2, data = simdata, family = poisson(link = "log"))
glm_nb <- glm.nb(Y ~ X1 + X2, data = simdata, link = "log")
summary(glm_poisson)
##
## Call:
## glm(formula = Y ~ X1 + X2, family = poisson(link = "log"), data = simdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.995434 0.014925 66.70 <2e-16 ***
## X1 0.511899 0.006608 77.47 <2e-16 ***
## X2 0.995539 0.006797 146.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 91470 on 9999 degrees of freedom
## Residual deviance: 63327 on 9997 degrees of freedom
## AIC: 113055
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm.nb(formula = Y ~ X1 + X2, data = simdata, link = "log", init.theta = 4.967762132)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.98915 0.03658 27.04 <2e-16 ***
## X1 0.51132 0.01689 30.28 <2e-16 ***
## X2 1.00018 0.01707 58.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.9678) family taken to be 1)
##
## Null deviance: 14739 on 9999 degrees of freedom
## Residual deviance: 10364 on 9997 degrees of freedom
## AIC: 77883
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.9678
## Std. Err.: 0.0835
##
## 2 x log-likelihood: -77874.7800
The estimated slopes are very similar. However, the standard errors are not. The fact \(\mu\) does not equal \(\sigma^2\) has caused a major under estimation of standard errors for the Poisson GLM.
For Poisson models, it is important to look at residual deviance divided by the degrees of freedom. When this quotient is larger than 1, the negative binomial should be considered. In the above the quotient was 6.33 (63327 / 9997) for the Poisson glm.