---
title: "6. Modelling Mortality"
output:
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    fig_caption: yes
    toc: true
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{6. Modelling Mortality}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

In this vignette, we estimate mortality rates, and summary indicators such as life expectancy. We work with data from Iceland. We try several methods for representing age patterns, including one that uses typically patterns from the Human Mortality Database. All our model-based estimates come with measures of uncertainty. 

In addition to **bage** itself, we need package **poputils**, which contains functions for calculating life expectancy, and **rvec**, which contains data structures and functions for working with random draws.

```{r setup}
library(bage)
library(poputils)
library(rvec)
```

We use standard tidyverse tools for data manipulation and graphing.

```{r}
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
```

# The data

The Iceland deaths data is included in the **bage** package. It contains counts of deaths, and estimates of the mid-year population, disaggregated by age and sex, for the years 1998--2022.
```{r}
dth <- bage::deaths
dth
```

The oldest age group is 105 years and older.
```{r}
tail(dth, n = 3)
```

The data is sparse. Twenty-two percent of death counts are 0, and half are 3 or less.

```{r}
dth |>
  count(deaths) |>
  mutate(percent = round(100 * n / sum(n)),
         cumulative_percent = cumsum(percent)) |>
  head()
```

We plot 'direct' estimates of death rates in the first and least years of the data. Direct estimates are calculated by dividing the number of deaths by the corresponding population at risk, independently for each combination of age, sex, and time. The results are shown below, on a log scale. The dots at the bottom of the graph represent log-rates of negative infinity, which occur in cells where no deaths were observed. Random variation obscures any patterns before about age 50. After age 50, rates increase more or less linearly.

```{r, fig.width = 7, fig.height = 6}
dth |>
  filter(time %in% c(1998, 2010, 2022)) |>
  mutate(rate = deaths / popn) |>
  ggplot(aes(x = age_mid(age), y = rate)) + ## 'age_mid()' returns the mid point
  facet_grid(vars(sex), vars(time)) +       ## of the age group, which is useful
  geom_point(alpha = 0.5) +                 ## for plotting
  scale_y_log10() +
  ggtitle("Direct estimates of mortality rates")
```


# Initial model

## Specifying the model

We fit an initial simple model. In our prior model, we allow for an interaction between age and sex, but do not allow for any interactions involving time. We assume, in other words, that mortality rates rise or fall at the same rate across all age-sex groups. This assumption is unlikely to be met exactly in practice, so we revisit it later. We accept all the default priors for main effects and interactions. As can be seen at the bottom of the printout, `mod_pois()` has guessed which variables represent age, sex, and time. (It bases its guesses on the variable names.) Age and time main effects, by default, get "random walk" priors, and other terms get "normal" priors. 

But before running the model there is an annoying, but common, problem with the data with.

It turns out there are 5 cases with non-zero deaths but zero population.
```{r}
dth |>
  filter(deaths > 0 & popn == 0)
```
We could edit the data directly, removing or modify these cases. However, dealing with these cases is a modelling decision, which we would like to include in the model itself. We make use of the fact that exposure can be specified via a formula to modify cases where we have deaths but no population. Our model is then

Our model is then
```{r}
mod_base <- mod_pois(deaths ~ age * sex + time,
                 data = dth,
                 exposure = ~ ifelse(deaths > 0 & popn == 0, 0.5, popn))
mod_base
```
## Mathematical description of the model

The model implemented by `mod_base` assumes that
\begin{equation}
  y_i \sim \text{Poisson}(\gamma_i w_i)
\end{equation}
where

- $y_i$ is deaths in age-sex-time cell $i$,
- $w_i$ is the population at risk, and
- $\gamma_i$ is the underlying death rate.

By modelling deaths as draws from a Poisson distribution, the model recognizes the contribution of individual-level randomness to observed death counts. Recognizing individual-level randomness is important when analyzing data where cell counts are small. 

Death rates $\gamma_i$ are treated as draws from a Gamma distribution,

\begin{equation}
  \gamma_i \sim \text{Gamma}(\xi^{-1}, (\xi \mu_i)^{-1}).
\end{equation}

The expected value for $\gamma_i$ is $\mu_i$, and the variance is $\xi \mu_i^2$. By modelling $\gamma_i$ as a draw from a distribution, we are recognizing our model is only able to explain some of the variation in $\gamma_i$. Larger values for $\xi$ imply more unexplained variation.

We model $\mu_i$, on the log scale, as a sum of factors formed from the the dimensions of the data,
\begin{equation}
  \log \mu_i = \sum_{m=0}^4 \beta_{j_i^m}^{(m)} 
\end{equation}
where

- $\pmb{\beta}^{(0)}$ is an intercept,
- $\pmb{\beta}^{(m)}$, $m = 1, \cdots, 4$, is a main effect or interaction, and
- $j_i^m$ is the element of $\pmb{\beta}^{(m)}$ that is associated with cell $i$.

Each of the $\beta^{(m)}$ in the model for $\mu_i$ receives a prior. The default prior for the intercept is
\begin{equation}
  \beta^{(0)} \sim \text{N}(0, 10^2)
\end{equation}
This prior allows for a much broader range of values than we would expect to see in practice. We would only rarely expect to see values of $\gamma_i$ that were lower than 0.001 (which is -7 on a log scale) or higher than 1 (which is 0).

The default prior for age is a first-order random walk,
\begin{equation}
  \beta_j^{(1)} \sim \text{N}(\beta_{j-1}^{(1)}, \tau_1^2), \quad j = 2, \dots, J_1
\end{equation}
A random walk prior embodies the idea that we expect changes from one age group to the next to be relatively small, and that neighboring age groups are more strongly correlated than distant age groups.

The value for the standard deviation parameter $\tau_1$ is estimated as part of the model, and has its own prior,
\begin{equation}
    \tau_1^2 \sim \text{N}^+(0, 1)
\end{equation}
$\text{N}^+(0, 1)$ denotes a half-normal distribution, that is, a normal distribution restricted to non-negative values.

For this prior to be properly identified, we need to place some constraint on the overall level. We use a soft constraint on the mean, 
\begin{equation}
  \frac{1}{J_1}\sum_{j=1}^{J_1} \beta_j^{(1)} \sim \text{N}(0, 1)
\end{equation}

The sex main effect has the prior
\begin{align}
  \beta_j^{(2)} & \sim \text{N}(0, \tau_2^2), \quad j = 1, 2 \\
  \tau_2^2 & \sim \text{N}^+(0, 1)
\end{align}

The time term $\beta^{(3)}$ has the same random-walk prior as the age effect. The age-sex interaction $\beta^{(4)}$ has the same prior as the sex effect.

Finally, the dispersion term $\xi$ has a penalized-complexity prior [@simpson2022priors],
\begin{equation}
    p(\xi) = \frac{1}{2 \sqrt{\xi}}e^{-\sqrt{\xi}}.
\end{equation}


## Fitting the model

We fit the model by calling function `fit()`:
```{r}
mod_base <- fit(mod_base)
mod_base
```
## Extracting rates

To extract estimated rates from the fitted model object, we use function `augment()`. 
```{r}
aug_base <- augment(mod_base)
aug_base
```

Function `augment()` starts with the original data, and adds 

- a column called `.observed` containing direct estimates ($y_i/w_i$), 
- a column called `.fitted` containing estimates of the rates ($\gamma_i$), and
- a column called `.expected` containing expected values for the rates ($\mu_i$).

The `.fitted` and `.expected` columns both consist of [rvecs](https://bayesiandemography.github.io/rvec/reference/rvec.html). An rvec is a vector-like object that holds multiple draws, in this case draws from posterior distributions. 

Next we extract rates estimates for selected years, and summarize the posterior distributions.

```{r}
rates_base <- aug_base |>
  filter(time %in% c(1998, 2010, 2022)) |>
  select(age, sex, time, .observed, .fitted) |>
  mutate(draws_ci(.fitted))
rates_base
```

We plot point estimates and 95% credible intervals for the modelled estimates, together with the original direct estimates,
```{r, fig.width = 7, fig.height = 5}
ggplot(rates_base, aes(x = age_mid(age), 
             ymin = .fitted.lower,
             y = .fitted.mid,
             ymax = .fitted.upper)) +
  facet_grid(vars(sex), vars(time)) +
  geom_ribbon(fill = "lightblue") +
  geom_line(col= "darkblue") +
  geom_point(aes(y = .observed),
             size = 0.2) +
  scale_y_log10() +
  ggtitle("Modelled and direct estimates of mortality rates - base model")
```

## Extracting higher-level terms

We can gain insights into the model by extracting and graphing estimates of the main effects and interactions, the $\pmb{\beta}^{(m)}$.

Estimates of the main effects and interactions, and of other higher-level parameters, can be obtained with function `components()`.
```{r}
comp_base <- components(mod_base)
comp_base
```

`components()` returns a tibble contain estimates of all the higher-level parameters. To extract the age effect, and prepare it for graphing, we use
```{r}
age_effect <- comp_base |>
  filter(component == "effect",
         term == "age") |>
  mutate(draws_ci(.fitted))
```

A graph of the age effect reveals a typical profile for mortality rates,
```{r}
ggplot(age_effect,
       aes(x = age_mid(level),
           y = .fitted.mid,
           ymin = .fitted.lower,
           ymax = .fitted.upper)) +
  geom_ribbon(fill = "lightblue") +
  geom_line() +
  ggtitle("Age effect")
```


# Revised model

## Mathematical structure

The [Human Mortality Database](https://www.mortality.org) provides estimates of mortality rates for many countries. Let $\pmb{M}$ denote a matrix holding these estimates, on a log scale. By applying a [Singlar Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) to $\pmb{M}$, and then rescaling, we can construct a matrix $\pmb{F}$ and a column vector $\pmb{g}$, that parsimoniously summarize the age-profiles, or age-sex profiles, that are observed in $\pmb{M}$. If $z$ is a vector of independent draws from a $\text{N}(0,1)$ distribution, then the resulting vector
\begin{equation}
  \pmb{m} = \pmb{F} \pmb{z} + \pmb{g}
\end{equation}
looks like a randomly-selected column from $\pmb{M}$.

We use $\pmb{F}$ and $\pmb{g}$ to build priors for age effects and age-sex interactions. The prior for age effects is
\begin{align}
  \pmb{\beta}^{(1)} & = \pmb{F} \pmb{\alpha} + \pmb{g} \\
  \alpha_k & \sim \text{N}(0, 1), \quad k = 1, \cdots, K
\end{align}
where $K$ is chosen by the user. The prior for age-sex interactions looks the same except that, by default, the calculations are done separately for each sex.

## Specifying the new model

```{r}
mod_hmd <- mod_pois(deaths ~ age:sex + time,
                    data = dth,
                    exposure = ~ ifelse(deaths > 0 & popn == 0, 0.5, popn)) |>
  set_prior(age:sex ~ SVD(HMD))
mod_hmd
```


```{r}
mod_hmd <- fit(mod_hmd)
mod_hmd
```


```{r, fig.width = 7, fig.height = 5}
aug_hmd <- augment(mod_hmd)

rates_hmd <- aug_hmd |>
  filter(time %in% c(1998, 2010, 2022)) |>
  select(age, sex, time, .observed, .fitted) |>
  mutate(draws_ci(.fitted))

ggplot(rates_hmd, aes(x = age_mid(age), 
             ymin = .fitted.lower,
             y = .fitted.mid,
             ymax = .fitted.upper)) +
  facet_grid(vars(sex), vars(time)) +
  geom_ribbon(fill = "lightblue") +
  geom_line(col= "darkblue") +
  geom_point(aes(y = .observed),
             size = 0.2) +
  scale_y_log10() +
  ggtitle("Modelled and direct estimates of mortality rates - HMD model")
```


```{r}
comp_hmd <- components(mod_hmd)

age_sex_interact <- comp_hmd |>
  filter(component == "effect",
         term == "age:sex") |>
  separate_wider_delim(level, delim = ".", names = c("age", "sex")) |>
  mutate(draws_ci(.fitted))

ggplot(age_sex_interact,
       aes(x = age_mid(age),
           y = .fitted.mid,
           ymin = .fitted.lower,
           ymax = .fitted.upper)) +
  geom_ribbon(aes(fill = sex),
              alpha = 0.3) +
  geom_line(aes(col = sex)) +
  ggtitle("Age-sex interaction")
```


# Model testing

```{r, fig.width = 7, fig.height = 7}
rep_data_base <- replicate_data(mod_base, condition_on = "expected")

data <- rep_data_base |>
  filter(time == 2022) |>
  select(-popn) |>
  pivot_wider(names_from = sex, values_from = deaths) |>
  mutate(diff = Female - Male)
```
  
```{r, fig.width = 7, fig.height = 7}
ggplot(data, aes(x = age_mid(age), y = diff)) +
  facet_wrap(vars(.replicate)) +
  geom_point(size = 0.2)
```

```{r, fig.width = 7, fig.height = 7}
rep_data_hmd <- replicate_data(mod_hmd, condition_on = "expected")

data <- rep_data_hmd |>
  filter(time == 2022) |>
  select(-popn) |>
  pivot_wider(names_from = sex, values_from = deaths) |>
  mutate(diff = Female - Male)
```
  
```{r, fig.width = 7, fig.height = 7}
ggplot(data, aes(x = age_mid(age), y = diff)) +
  facet_wrap(vars(.replicate)) +
  geom_point(size = 0.2)
```


# Life expectancy and life tables

```{r}
lifeexp_hmd <- mod_hmd |>
  augment() |>
  lifeexp(mx = .fitted,
          by = c(time, sex))
lifeexp_hmd
```

```{r}
lifeexp_hmd <- mod_hmd |>
  augment() |>
  lifeexp(mx = .fitted,
          by = c(time, sex))
lifeexp_hmd
```