---
title: "Linear regression diagnostics"
description: A tutorial on linear regression diagnostics using the regressinator.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Linear regression diagnostics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(regressinator)
library(ggplot2)
library(broom)
```

The basic linear regression model assumes that
$$
Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon,
$$
where $\epsilon$ has mean zero and variance $\sigma^2$. The regressors $X_1,
\dots, X_p$ might be the original predictors, or could be transformations and
combinations of them, such as interactions or polynomials. The assumptions on
the distribution of $\epsilon$ are used to obtain the variance of $\hat \beta$
and hence are important for inference, while the assumption of a linear
relationship determines whether our mean function is accurate.

For these examples, we'll consider a simple multivariate setting where the
population relationship is nonlinear and we misspecify the model:

```{r}
nonlinear_pop <- population(
  x1 = predictor(runif, min = 1, max = 8),
  x2 = predictor(runif, min = 4, max = 12),
  y = response(0.7 + 0.8 * x1**2 + 1.2 * x2,
               family = gaussian(), error_scale = 4.0)
)

nonlinear_data <- sample_x(nonlinear_pop, n = 100) |>
  sample_y()

fit <- lm(y ~ x1 + x2, data = nonlinear_data)
```

We'll use `broom::augment()` to get diagnostic information from the fit in a
standardized data frame, making it easy to use ggplot2 to produce different
diagnostic plots.

## Residual plots

Residual plots can detect misspecification of the mean function. When we plot
the residuals against any linear combination of the regressors, they should have
mean zero and constant variance. If they do not, the model may be incorrect.

One such linear combination is the fitted values $\hat Y$. We can obtain the
residuals and fitted values from `broom::augment()`:

```{r}
augment(fit) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "Fitted value", y = "Residual")
```

There is a suspicious trend here, but it is hard to tell what specifically is
wrong from the model from this. It would be more helpful to know which predictor
is modeled incorrectly.

To plot residuals versus the regressors, the regressinator provides
`augment_longer()`, which is like `augment()` but converts the data to long form
with one row per predictor per observation, making it easy to facet the
residuals:

```{r}
augment_longer(fit) |>
  ggplot(aes(x = .predictor_value, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
```

The trend in the residuals against `x1` makes it clear that something is wrong
with our specification of `x1` in the model, which is correct: we modeled it as
linear when it should be quadratic.

Detecting misspecification is a matter of judgment, as the residuals are random
and trends can appear even when the model is well-specified. Using
`model_lineup()`, we can compare the true residual plots to several where the
model is correctly specified. Each row gives the plots for one simulation, and
one of the fives rows (at random) is the true residual plots:

```{r, fig.height=8}
model_lineup(fit, fn = augment_longer, n = 5) |>
  ggplot(aes(x = .predictor_value, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(rows = vars(.sample), cols = vars(.predictor_name),
             scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
```

This gives a sense of the variation to be expected when the model is
well-specified. It should still be easy to spot the row containing the plots for
our misspecified model.

## Partial residual plots

The `partial_residuals()` function fetches partial residuals in a convenient
data frame format. Partial residuals are defined in terms of the original
predictors, not the regressors; we can think of a partial residual plot for a
particular predictor as showing the relationship between that predictor and the
response, after "adjusting for" the other predictors according to our fitted
model. See the function documentation for references on the use and
interpretation of partial residuals.

### Detecting basic misspecification

Here the black line gives the fitted predictor effects (i.e. the model estimate
of the relationship), while the blue line smooths the partial residuals. We can
see that for `x1`, the blue line deviates systematically from the black line:

```{r}
partial_residuals(fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
```

The shape of the blue line approximates the shape of the true relationship
between `x1` and the response (provided the rest of the model is
well-specified!), so we can use its shape to determine how to change our model.

We can again use `model_lineup()` to see how these compare to the partial
residuals when the model is correctly specified:

```{r, fig.height=8}
model_lineup(fit, partial_residuals, n = 5) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_line(aes(x = .predictor_value, y = .predictor_effect)) +
  facet_grid(rows = vars(.sample), cols = vars(.predictor_name),
             scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
```

In the plots where the model is correctly specified, the blue and black lines
coincide, making it easy to spot the misspecified model.

### Checking for interactions

Partial residuals can also be used to detect interaction terms that should be
included in the model. For example, let's define a population in which $X_2$ is
a factor variable with two levels, and the slope and intercept of the
relationship between $X_1$ and $Y$ depend on the value of $X_2$.

```{r}
intercepts <- c(
  "foo" = -0.3,
  "bar" = 1.7
)

slopes <- c(
  "foo" = 1.8,
  "bar" = -0.4
)

interact_pop <- population(
  x1 = predictor(runif, min = 1, max = 8),
  x2 = predictor(rfactor, levels = c("foo", "bar")),
  y = response(by_level(x2, intercepts) + by_level(x2, slopes) * x1,
               family = gaussian(), error_scale = 4.0)
)

interact_data <- sample_x(interact_pop, n = 100) |>
  sample_y()

no_interact_fit <- lm(y ~ x1 + x2, data = interact_data)
```

We have fit a model that does not account for the interaction, allowing a
different intercept for the two groups but forcing the slope of $X_1$ to be the
same. If we are concerned the interaction may be present, we can make the
conventional partial residual plot, but with the points colored by the value of
$X_2$:

```{r}
partial_residuals(no_interact_fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
```

Notice the very distinct smoothing lines: the `foo` group appears to have a much
steeper slope than the `bar` group. This is in fact true in the population. This
may alert us that while our average slope is accurate, and the relationship
appears reasonably linear, the slope may differ for the groups. We could, of
course, produce a lineup to assess if the slope difference is significant, but
it appears fairly large here.

We can now fit a model with the interaction term and generate partial residuals
again:

```{r}
interact_fit <- lm(y ~ x1 * x2, data = interact_data)

partial_residuals(interact_fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
```

In this plot, both groups appear to have the same linear trend, and so adding
the interaction appears to have been a good idea.

Note the positive slope in this plot. In the calculation of partial residuals
for $X_1$, all other predictors are set to 0 or their baseline level, so $X_2$
is set to `foo`. Hence the overall slope is approximately 1.8, even for points
with $X_2 = \text{bar}$, which can be confusing.

## Cook's distances

The Cook's distance for an observation represents how much the model fitted
values would change if that observation were removed, scaled by the model's mean
squared error. A Cook's distance of 1 is often considered a cutoff for a highly
influential observation. Returning to our original model fit to nonlinear data,
we can produce the Cook's distances:

```{r}
augment(fit) |>
  ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_col() +
  labs(x = "Row index", y = "Cook's distance")
```

Note that the Cook's distance measures changes in fitted values. If several
regressors are collinear, a small change in an observation may change $\hat
\beta$ dramatically but not change the fitted values much. (It can also be
interpreted as the Mahalanobis distance between the fitted $\hat \beta$ with and
without each observation, where the Mahalanobis distance takes into account the
variance of $\hat \beta$; collinear regressors will tend to have high variance
in their parameter estimates, so while dropping a covariate may change their
entries in $\hat \beta$, this will not cause a great increase in the Mahalanobis
distance.)

## Residual Q-Q plots

Q-Q plots are used to see the distribution of the standardized residuals, and
compare it to a normal distribution:

```{r}
augment(fit) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")
```

It is difficult for novices to judge normality from a Q-Q plot, so a lineup can
be helpful:

```{r, fig.height=8}
model_lineup(fit) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(vars(.sample)) +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")
```

The true Q-Q plot does not stand out here, so we have little evidence for
systematically non-normal residuals.

## A case study

Let's illustrate the use of these diagnostics on a real dataset: the [Palmer
Penguins](https://allisonhorst.github.io/palmerpenguins/) data, giving
measurements of 344 penguins in the Palmer Archipelago in Antarctica.

```{r}
library(palmerpenguins)
```

Let's consider building a model of bill length as a function of flipper length
and species:

```{r}
penguin_1 <- lm(bill_length_mm ~ flipper_length_mm + species,
                data = penguins)
```

This model allows a different intercept per species, but not a different slope.
Let's plot the partial residuals for flipper length:

```{r}
partial_residuals(penguin_1, flipper_length_mm) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point(aes(color = species)) +
  geom_smooth(aes(color = species), se = FALSE) +
  geom_line(aes(y = .predictor_effect)) +
  labs(x = "Flipper length (mm)", y = "Partial residual",
       color = "Species")
```

Here we can see that gentoo penguins tend to have longer flippers than Adélie
and chinstrap penguins. Looking at the partial residuals in each group, there
appear to be varying slopes: there is a consistent upward trend in the residuals
for gentoo penguins relative to the overall trend, for instance. This suggests
that perhaps there is an interaction, and the slope of the relationship differs
for each species.

Let's fit a new model with an interaction.

```{r}
penguin_2 <- lm(bill_depth_mm ~ flipper_length_mm * species,
                data = penguins)
```

Now the residuals do not seem to have separate trends in each group:

```{r}
partial_residuals(penguin_2, flipper_length_mm) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point(aes(color = species)) +
  geom_smooth(aes(color = species), se = FALSE) +
  geom_line(aes(y = .predictor_effect)) +
  labs(x = "Flipper length (mm)", y = "Partial residual",
       color = "Species")
```

We can now proceed to other diagnostics, such as checking for equal variance and
normality of the residuals. Let's make a Q-Q plot of the standardized residuals
for each species from this fit:

```{r}
augment(penguin_2) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(vars(species)) +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")
```

The residuals appear reasonably normally distributed, but notice the slopes are
different for each species, indicating they have different variances. The
residual variance appears to be largest for Adélie penguins and smallest for
gentoo penguins. If we were using this model purely for prediction, this would
not be a concern; but if we intend to conduct tests or produce confidence
intervals, this may be an issue.

There are no obvious major outliers either in the partial residual plots or
residual Q-Q plots, so we have no reason to suspect there are highly influential
observations. Nonetheless, we can check the Cook's distances:

```{r}
augment(penguin_2) |>
  ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_col() +
  labs(x = "Row index", y = "Cook's distance")
```

This looks fine. Even the most influential point still has a very small Cook's
distance.