--- title: "Linear regression diagnostics" description: A tutorial on linear regression diagnostics using the regressinator. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Linear regression diagnostics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(regressinator) library(ggplot2) library(broom) ``` The basic linear regression model assumes that $$ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon, $$ where $\epsilon$ has mean zero and variance $\sigma^2$. The regressors $X_1, \dots, X_p$ might be the original predictors, or could be transformations and combinations of them, such as interactions or polynomials. The assumptions on the distribution of $\epsilon$ are used to obtain the variance of $\hat \beta$ and hence are important for inference, while the assumption of a linear relationship determines whether our mean function is accurate. For these examples, we'll consider a simple multivariate setting where the population relationship is nonlinear and we misspecify the model: ```{r} nonlinear_pop <- population( x1 = predictor(runif, min = 1, max = 8), x2 = predictor(runif, min = 4, max = 12), y = response(0.7 + 0.8 * x1**2 + 1.2 * x2, family = gaussian(), error_scale = 4.0) ) nonlinear_data <- sample_x(nonlinear_pop, n = 100) |> sample_y() fit <- lm(y ~ x1 + x2, data = nonlinear_data) ``` We'll use `broom::augment()` to get diagnostic information from the fit in a standardized data frame, making it easy to use ggplot2 to produce different diagnostic plots. ## Residual plots Residual plots can detect misspecification of the mean function. When we plot the residuals against any linear combination of the regressors, they should have mean zero and constant variance. If they do not, the model may be incorrect. One such linear combination is the fitted values $\hat Y$. We can obtain the residuals and fitted values from `broom::augment()`: ```{r} augment(fit) |> ggplot(aes(x = .fitted, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + labs(x = "Fitted value", y = "Residual") ``` There is a suspicious trend here, but it is hard to tell what specifically is wrong from the model from this. It would be more helpful to know which predictor is modeled incorrectly. To plot residuals versus the regressors, the regressinator provides `augment_longer()`, which is like `augment()` but converts the data to long form with one row per predictor per observation, making it easy to facet the residuals: ```{r} augment_longer(fit) |> ggplot(aes(x = .predictor_value, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + facet_wrap(vars(.predictor_name), scales = "free_x") + labs(x = "Predictor", y = "Residual") ``` The trend in the residuals against `x1` makes it clear that something is wrong with our specification of `x1` in the model, which is correct: we modeled it as linear when it should be quadratic. Detecting misspecification is a matter of judgment, as the residuals are random and trends can appear even when the model is well-specified. Using `model_lineup()`, we can compare the true residual plots to several where the model is correctly specified. Each row gives the plots for one simulation, and one of the fives rows (at random) is the true residual plots: ```{r, fig.height=8} model_lineup(fit, fn = augment_longer, n = 5) |> ggplot(aes(x = .predictor_value, y = .resid)) + geom_point() + geom_smooth(se = FALSE) + facet_grid(rows = vars(.sample), cols = vars(.predictor_name), scales = "free_x") + labs(x = "Predictor", y = "Residual") ``` This gives a sense of the variation to be expected when the model is well-specified. It should still be easy to spot the row containing the plots for our misspecified model. ## Partial residual plots The `partial_residuals()` function fetches partial residuals in a convenient data frame format. Partial residuals are defined in terms of the original predictors, not the regressors; we can think of a partial residual plot for a particular predictor as showing the relationship between that predictor and the response, after "adjusting for" the other predictors according to our fitted model. See the function documentation for references on the use and interpretation of partial residuals. ### Detecting basic misspecification Here the black line gives the fitted predictor effects (i.e. the model estimate of the relationship), while the blue line smooths the partial residuals. We can see that for `x1`, the blue line deviates systematically from the black line: ```{r} partial_residuals(fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual") ``` The shape of the blue line approximates the shape of the true relationship between `x1` and the response (provided the rest of the model is well-specified!), so we can use its shape to determine how to change our model. We can again use `model_lineup()` to see how these compare to the partial residuals when the model is correctly specified: ```{r, fig.height=8} model_lineup(fit, partial_residuals, n = 5) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point() + geom_smooth(se = FALSE) + geom_line(aes(x = .predictor_value, y = .predictor_effect)) + facet_grid(rows = vars(.sample), cols = vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual") ``` In the plots where the model is correctly specified, the blue and black lines coincide, making it easy to spot the misspecified model. ### Checking for interactions Partial residuals can also be used to detect interaction terms that should be included in the model. For example, let's define a population in which $X_2$ is a factor variable with two levels, and the slope and intercept of the relationship between $X_1$ and $Y$ depend on the value of $X_2$. ```{r} intercepts <- c( "foo" = -0.3, "bar" = 1.7 ) slopes <- c( "foo" = 1.8, "bar" = -0.4 ) interact_pop <- population( x1 = predictor(runif, min = 1, max = 8), x2 = predictor(rfactor, levels = c("foo", "bar")), y = response(by_level(x2, intercepts) + by_level(x2, slopes) * x1, family = gaussian(), error_scale = 4.0) ) interact_data <- sample_x(interact_pop, n = 100) |> sample_y() no_interact_fit <- lm(y ~ x1 + x2, data = interact_data) ``` We have fit a model that does not account for the interaction, allowing a different intercept for the two groups but forcing the slope of $X_1$ to be the same. If we are concerned the interaction may be present, we can make the conventional partial residual plot, but with the points colored by the value of $X_2$: ```{r} partial_residuals(no_interact_fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual") ``` Notice the very distinct smoothing lines: the `foo` group appears to have a much steeper slope than the `bar` group. This is in fact true in the population. This may alert us that while our average slope is accurate, and the relationship appears reasonably linear, the slope may differ for the groups. We could, of course, produce a lineup to assess if the slope difference is significant, but it appears fairly large here. We can now fit a model with the interaction term and generate partial residuals again: ```{r} interact_fit <- lm(y ~ x1 * x2, data = interact_data) partial_residuals(interact_fit) |> ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) + geom_point() + # partial residuals geom_smooth(se = FALSE) + # smoothed residuals geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects facet_wrap(vars(.predictor_name), scales = "free") + labs(x = "Predictor", y = "Partial residual") ``` In this plot, both groups appear to have the same linear trend, and so adding the interaction appears to have been a good idea. Note the positive slope in this plot. In the calculation of partial residuals for $X_1$, all other predictors are set to 0 or their baseline level, so $X_2$ is set to `foo`. Hence the overall slope is approximately 1.8, even for points with $X_2 = \text{bar}$, which can be confusing. ## Cook's distances The Cook's distance for an observation represents how much the model fitted values would change if that observation were removed, scaled by the model's mean squared error. A Cook's distance of 1 is often considered a cutoff for a highly influential observation. Returning to our original model fit to nonlinear data, we can produce the Cook's distances: ```{r} augment(fit) |> ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) + geom_col() + labs(x = "Row index", y = "Cook's distance") ``` Note that the Cook's distance measures changes in fitted values. If several regressors are collinear, a small change in an observation may change $\hat \beta$ dramatically but not change the fitted values much. (It can also be interpreted as the Mahalanobis distance between the fitted $\hat \beta$ with and without each observation, where the Mahalanobis distance takes into account the variance of $\hat \beta$; collinear regressors will tend to have high variance in their parameter estimates, so while dropping a covariate may change their entries in $\hat \beta$, this will not cause a great increase in the Mahalanobis distance.) ## Residual Q-Q plots Q-Q plots are used to see the distribution of the standardized residuals, and compare it to a normal distribution: ```{r} augment(fit) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles") ``` It is difficult for novices to judge normality from a Q-Q plot, so a lineup can be helpful: ```{r, fig.height=8} model_lineup(fit) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + facet_wrap(vars(.sample)) + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles") ``` The true Q-Q plot does not stand out here, so we have little evidence for systematically non-normal residuals. ## A case study Let's illustrate the use of these diagnostics on a real dataset: the [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/) data, giving measurements of 344 penguins in the Palmer Archipelago in Antarctica. ```{r} library(palmerpenguins) ``` Let's consider building a model of bill length as a function of flipper length and species: ```{r} penguin_1 <- lm(bill_length_mm ~ flipper_length_mm + species, data = penguins) ``` This model allows a different intercept per species, but not a different slope. Let's plot the partial residuals for flipper length: ```{r} partial_residuals(penguin_1, flipper_length_mm) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point(aes(color = species)) + geom_smooth(aes(color = species), se = FALSE) + geom_line(aes(y = .predictor_effect)) + labs(x = "Flipper length (mm)", y = "Partial residual", color = "Species") ``` Here we can see that gentoo penguins tend to have longer flippers than Adélie and chinstrap penguins. Looking at the partial residuals in each group, there appear to be varying slopes: there is a consistent upward trend in the residuals for gentoo penguins relative to the overall trend, for instance. This suggests that perhaps there is an interaction, and the slope of the relationship differs for each species. Let's fit a new model with an interaction. ```{r} penguin_2 <- lm(bill_depth_mm ~ flipper_length_mm * species, data = penguins) ``` Now the residuals do not seem to have separate trends in each group: ```{r} partial_residuals(penguin_2, flipper_length_mm) |> ggplot(aes(x = .predictor_value, y = .partial_resid)) + geom_point(aes(color = species)) + geom_smooth(aes(color = species), se = FALSE) + geom_line(aes(y = .predictor_effect)) + labs(x = "Flipper length (mm)", y = "Partial residual", color = "Species") ``` We can now proceed to other diagnostics, such as checking for equal variance and normality of the residuals. Let's make a Q-Q plot of the standardized residuals for each species from this fit: ```{r} augment(penguin_2) |> ggplot(aes(sample = .std.resid)) + geom_qq() + geom_qq_line() + facet_wrap(vars(species)) + labs(title = "Normal Q-Q plot of standardized residuals", x = "Theoretical quantiles", y = "Observed quantiles") ``` The residuals appear reasonably normally distributed, but notice the slopes are different for each species, indicating they have different variances. The residual variance appears to be largest for Adélie penguins and smallest for gentoo penguins. If we were using this model purely for prediction, this would not be a concern; but if we intend to conduct tests or produce confidence intervals, this may be an issue. There are no obvious major outliers either in the partial residual plots or residual Q-Q plots, so we have no reason to suspect there are highly influential observations. Nonetheless, we can check the Cook's distances: ```{r} augment(penguin_2) |> ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) + geom_col() + labs(x = "Row index", y = "Cook's distance") ``` This looks fine. Even the most influential point still has a very small Cook's distance.