In a model with geographic components, we want to express a functional \(T\) (usually the expectation or a quantile) of a response \(Y\) as a function \(f\) of a set of geographic features (latitude/longitude and/or postal code and/or other features varying with location), and other features:
\[ T(Y \mid X^\textrm{geo}, X^\textrm{other}) \approx f(X^\textrm{geo}, X^\textrm{other}) \] Like any feature, the effect of a single geographic feature \(X^{\textrm{geo}, j}\) can be described using SHAP dependence plots. However, studying the effect of latitude (or any other location dependent feature) alone is often not very illuminating - simply due to strong interaction effects and correlations with other geographic features.
That’s where the additivity of SHAP values comes into play: The sum of SHAP values of all geographic components represent the total effect of \(X^\textrm{geo}\), and this sum can be visualized as a heatmap or 3D scatterplot against latitude/longitude (or any other geographic representation).
For illustration, we will use a beautiful house price dataset containing information on about 14’000 houses sold in 2016 in Miami-Dade County. Some of the columns are as follows:
(Italic features are geographic components.) For more background on this dataset, see Mayer et al. (2022).
We will fit an XGBoost model to explain the expected log(price) as a function of lat/long, size, and quality/age.
library(xgboost)
library(ggplot2)
library(shapviz)
miami <- miami |>
transform(
log_living = log(TOT_LVG_AREA),
log_land = log(LND_SQFOOT),
log_price = log(SALE_PRC)
)
x_coord <- c("LATITUDE", "LONGITUDE")
x_nongeo <- c("log_living", "log_land", "structure_quality", "age")
xvars <- c(x_coord, x_nongeo)
# Select training data
set.seed(1)
ix <- sample(nrow(miami), 0.8 * nrow(miami))
train <- miami[ix, ]
X_train <- train[xvars]
y_train <- train$log_price
# Fit XGBoost model
params <- list(learning_rate = 0.2, nthread = 1)
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train, nthread = 1)
fit <- xgb.train(params, dtrain, nrounds = 200)
Let’s first study selected SHAP dependence plots for an explanation dataset of size 2000.
X_explain <- X_train[1:2000, ]
sv <- shapviz(fit, X_pred = data.matrix(X_explain))
sv_dependence(
sv,
v = c("log_living", "structure_quality", "LONGITUDE", "LATITUDE"),
alpha = 0.2
)
# And now the two-dimensional plot of the sum of SHAP values
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE") +
coord_equal()
The last plot gives a good impression on price levels.
Notes:
We will now change above model in two ways, not unlike the model in Mayer et al. (2022):
The second step leads to a model that is additive in each non-geographic component and also additive in the combined location effect. According to the technical report Mayer (2022), SHAP dependence plots of additive components in a boosted trees model are shifted versions of corresponding partial dependence plots (evaluated at observed values). This allows a “Ceteris Paribus” interpretation of SHAP dependence plots of corresponding components.
# Extend the feature set
more_geo <- c("CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST")
xvars <- c(xvars, more_geo)
X_train <- train[xvars]
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train, nthread = 1)
# Build interaction constraint vector and add it to params
ic <- c(
list(which(xvars %in% c(x_coord, more_geo)) - 1),
as.list(which(xvars %in% x_nongeo) - 1)
)
params$interaction_constraints <- ic
# Fit XGBoost model
fit <- xgb.train(params, dtrain, nrounds = 200)
# SHAP analysis
X_explain <- X_train[2:2000, ]
sv <- shapviz(fit, X_pred = data.matrix(X_explain))
# Two selected features: Thanks to additivity, structure_quality can be read as
# Ceteris Paribus
sv_dependence(sv, v = c("structure_quality", "LONGITUDE"), alpha = 0.2)
# Total geographic effect (Ceteris Paribus thanks to additivity)
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE", add_vars = more_geo) +
coord_equal()
Again, the resulting total geographic effect looks reasonable. Note that, unlike in the first example, there are no interactions to non-geographic components, leading to a Ceteris Paribus interpretation. Furthermore, it contains the effect of the other regional features.