---
title: "Introduction to the construction of decision trees"
author: "Paola Cognigni and Andrew Sims"
date: "October 2021"
bibliography: "REFERENCES.bib"
csl: "nature-no-et-al.csl"
output: 
  rmarkdown::html_vignette:
    fig_width: 7
    fig_height: 5
    fig_caption: true
    df_print: kable
vignette: >
  %\VignetteIndexEntry{Introduction to the construction of decision trees}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| include = FALSE,
#| purl = FALSE
knitr::opts_chunk$set(
  collapse = TRUE,
  echo = FALSE,
  comment = "#>"
)
```

```{r}
#| purl = FALSE
#nolint start
```

```{r}
library("rdecision")
```

```{r}
#| purl = FALSE
#nolint end
```

# Decision Tree representations

A decision tree is a decision model that represents all possible pathways
through sequences of events (**nodes**), which can be under the experimenter's 
control (decisions) or not (chances). A decision tree can be represented 
visually according to a standardised grammar:

* **Decision nodes** (represented graphically by a square $\square$): these 
represent alternative paths that the model should compare, for example different
treatment plans. Each Decision node must be the source of two or more Actions. 
A decision tree can have one or more Decision nodes, which determine the 
possible strategies that the model compares. 
* **Chance nodes** (represented graphically by a circle $\bigcirc$): these 
represent alternative paths that are out of the experiment's control, for 
example the probability of developing a certain side effect.  Each Chance node
must be the source of one or more Reactions, each with a specified probability.
The probability of Reactions originating from a single Chance node must sum 
to 1.
* **Leaf nodes** (represented graphically by a triangle $\lhd$): these represent
the final outcomes of a path. No further Actions or Reactions can occur after a
Leaf node. A Leaf node can have a utility value (to a maximum of 1, indicating
perfect utility) and an interval over which the utility applies.

Nodes are linked by **edges**:

* **Actions** arise from Decision nodes, and
* **Reactions** arise from Chance nodes.

`rdecision` builds a Decision Tree model by defining these elements and their
relationships. For example, consider the fictitious and idealized decision
problem, introduced in the package README file, of choosing between providing 
two forms of lifestyle advice, offered to
people with vascular disease, which reduce the risk of needing an interventional
procedure. The cost to a healthcare provider of the interventional procedure
(e.g., inserting a stent) is 5000 GBP; the cost of providing the current form of
lifestyle advice, an appointment with a dietician (“diet”), is 50 GBP and the
cost of providing an alternative form, attendance at an exercise programme
(“exercise”), is 750 GBP. If an advice programme is successful, there is no
need for an interventional procedure. These costs can be defined as scalar
variables, as follows:

```{r}
#| echo = TRUE
cost_diet <- 50.0
cost_exercise <- 750.0
cost_stent <- 5000.0
```

The model for this fictional scenario can be defined by the following elements:

* Decision node: which programme to enrol the patient in.

```{r}
#| echo = TRUE
decision_node <- DecisionNode$new("Programme")
```

* Chance nodes: the chance that the patient will need an interventional 
procedure. This is different for the two programmes, so two chance nodes must 
be defined.

```{r}
#| echo = TRUE
chance_node_diet <- ChanceNode$new("Outcome")
chance_node_exercise <- ChanceNode$new("Outcome")
```

* Leaf nodes: the possible final states of the model, depending both on the
decision (which programme) and the chance of needing an intervention. Here, we
assume that the model has a time horizon of 1 year, and that the utility is the
same for all patients (the default values).

```{r}
#| echo = TRUE
leaf_node_diet_no_stent <- LeafNode$new("No intervention")
leaf_node_diet_stent <- LeafNode$new("Intervention")
leaf_node_exercise_no_stent <- LeafNode$new("No intervention")
leaf_node_exercise_stent <- LeafNode$new("Intervention")
```

These nodes can then be wired into a decision tree graph by defining the edges
that link pairs of nodes as actions or reactions.

## Actions

These are the two programmes being tested. The cost of each action, as
described in the example, is embedded into the action definition.

```{r}
#| echo = TRUE
action_diet <- Action$new(
  decision_node, chance_node_diet, cost = cost_diet, label = "Diet"
)
action_exercise <- Action$new(
  decision_node, chance_node_exercise, cost = cost_exercise, label = "Exercise"
)
```

## Reactions

These are the possible outcomes of each programme (success or failure), with
their relevant probabilities. 

To continue our fictional example, in a small trial of the “diet” programme,
12 out of 68 patients (17.6%) avoided having an interventional procedure within
one year, and in a separate small trial of the “exercise” programme 18 out of
58 patients (31.0%) avoided the interventional procedure within one year (it is
assumed that the baseline characteristics in the two trials were comparable). 

```{r}
#| echo = TRUE
s_diet <- 12L
f_diet <- 56L
s_exercise <- 18L
f_exercise <- 40L
```

Epidemiologically, we can interpret the trial results in terms of the incidence
proportions of having an adverse event (needing a stent) within one year, for
each treatment strategy. 

```{r}
#| echo = TRUE
ip_diet <- f_diet / (s_diet + f_diet)
ip_exercise <- f_exercise / (s_exercise + f_exercise)
nnt <- 1.0 / (ip_diet - ip_exercise)
```

The incidence proportions are `r round(ip_diet, 2L)` and 
`r round(ip_exercise, 2L)` for diet and exercise, respectively, noting that we
define a programme failure as the need to insert a stent. The number
needed to treat is the reciprocal of the difference in incidence proportions;
`r round(nnt, 2L)` people must be allocated to the exercise programme rather
than the diet programme to save one adverse event.

These trial results can be represented as probabilities of outcome success
(`p_diet`, `p_exercise`) derived from the incidence proportions of the trial
results.

```{r}
#| echo = TRUE
p_diet <- 1.0 - ip_diet
p_exercise <- 1.0 - ip_exercise

```

These probabilities, as well as the cost associated with each outcome, can then
be embedded into the reaction definition. The probabilities of traversing the
reactions for programme failure are set to `NA_real_`, which leaves `rdecision`
to calculate them at each evaluation of the decision tree, ensuring that the
total probability of leaving each chance node adds to 1.

```{r}
#| echo = TRUE
reaction_diet_success <- Reaction$new(
  chance_node_diet, leaf_node_diet_no_stent,
  p = p_diet, cost = 0.0, label = "Success"
)

reaction_diet_failure <- Reaction$new(
  chance_node_diet, leaf_node_diet_stent,
  p = NA_real_, cost = cost_stent, label = "Failure"
)

reaction_exercise_success <- Reaction$new(
  chance_node_exercise, leaf_node_exercise_no_stent,
  p = p_exercise, cost = 0.0, label = "Success"
)

reaction_exercise_failure <- Reaction$new(
  chance_node_exercise, leaf_node_exercise_stent,
  p = NA_real_, cost = cost_stent, label = "Failure"
)
```

When all the elements are defined and satisfy the restrictions of a Decision
Tree (see the documentation for the `DecisionTree` class for details), the whole
model can be built:

```{r}
#| echo = TRUE
dt <- DecisionTree$new(
  V = list(
    decision_node,
    chance_node_diet,
    chance_node_exercise,
    leaf_node_diet_no_stent,
    leaf_node_diet_stent,
    leaf_node_exercise_no_stent,
    leaf_node_exercise_stent
  ),
  E = list(
    action_diet,
    action_exercise,
    reaction_diet_success,
    reaction_diet_failure,
    reaction_exercise_success,
    reaction_exercise_failure
  )
)
```

`rdecision` includes a `draw` method to generate a diagram of a defined
Decision Tree.

```{r}
#| echo = TRUE
dt$draw()
```

# Evaluating a decision tree

## Base case

As a decision model, a Decision Tree takes into account the costs, probabilities
and utilities encountered as each strategy is traversed from left to right. In
this example, only two strategies (Diet or Exercise) exist in the model and can
be compared using the `evaluate()` method.

```{r}
#| echo = TRUE
rs <- dt$evaluate()
```

```{r}
with(data = rs, expr = {
  data.frame(
    Programme = Programme,
    Probability = round(Probability, digits = 2L),
    Cost = round(Cost, digits = 2L),
    stringsAsFactors = FALSE
  )
})
```

From the evaluation of the two strategies, it is apparent that the Diet strategy
has a marginally lower net cost by 
`r round(rs[[2L, "Cost"]] - rs[[1L, "Cost"]], 2L)` GBP.

```{r}
o_netc_diet <- rs[[which(rs[, "Programme"] == "Diet"), "Cost"]]
e_netc_diet <- cost_diet + (1.0 - p_diet) * cost_stent
o_netc_exercise <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]]
e_netc_exercise <- cost_exercise + (1.0 - p_exercise) * cost_stent
incc <- nnt * (cost_exercise - cost_diet)
o_deltac <- o_netc_exercise - o_netc_diet
e_deltac <- (incc - cost_stent) / nnt
e_cost_threshold <- (cost_stent / nnt) + cost_diet
nnt_threshold <- cost_stent / (cost_exercise - cost_diet)
e_success_threshold <- 1.0 - (ip_diet - (1.0 / nnt_threshold))
```

```{r}
#| purl = FALSE
# test that base case evaluation agrees with direct calculation
stopifnot(
  all.equal(o_netc_diet, e_netc_diet, tolerance = 5.0, scale = 1.0),
  all.equal(o_netc_exercise, e_netc_exercise, tolerance = 5.0, scale = 1.0),
  all.equal(o_deltac, e_deltac, tolerance = 1.0, scale = 1.0)
)
```

Because this example is structurally simple, we can verify the results by
direct calculation. The net cost per patient of the diet programme is the cost
of delivering the advice (`r gbp(cost_diet, p = TRUE)` GBP) plus the cost of
inserting a stent (`r gbp(cost_stent, p = TRUE)` GBP) multiplied by the
proportion who require a stent, or the failure rate of the programme,
`r round((1.0 - p_diet), 2L)`, equal to `r gbp(e_netc_diet, p = TRUE)` GBP. By a
similar
argument, the net cost per patient of the exercise programme is 
`r gbp(e_netc_exercise, p = TRUE)` GBP, as the model predicts. If
`r round(nnt, 2L)` patients are required to change from the diet programme to
the exercise programme to save one stent, the incremental increase in cost of
delivering the advice is the number needed to treat multiplied by the difference
in the cost of the programmes, `r gbp(cost_exercise, p = TRUE)` GBP - 
`r gbp(cost_diet, p = TRUE)` GBP, or `r gbp(incc, p = TRUE)` GBP. Because this
is greater than the cost saved by avoiding one stent 
(`r gbp(cost_stent, p = TRUE)` GBP), we can see that the additional net cost
per patient of delivering the programme is the difference between these costs,
divided by the number needed to treat, or `r gbp(e_deltac, p = TRUE)` GBP, as
the model predicts.

Note that this approach aggregates multiple paths that belong to the same
strategy (for example, the Success and Failure paths of the Diet strategy).
The option `by = "path"` can be used to evaluate each path separately.

```{r}
#| echo = TRUE
rp <- dt$evaluate(by = "path")
```

```{r}
with(data = rp, expr = {
  data.frame(
    Programme = Programme,
    Leaf = Leaf,
    Probability = round(Probability, digits = 2L),
    Cost = round(Cost, digits = 2L),
    stringsAsFactors = FALSE
  )
})
```

## Adjustment for disutility
Cost is not the only consideration that can be modelled using a 
Decision Tree. Suppose that requiring an intervention reduces the quality of
life of patients, associated with attending pre-operative appointments, pain
and discomfort of the procedure, and adverse events. This is estimated to be
associated with a mean disutility of 0.05 for those who receive a stent, assumed
to persist over 1 year.

To incorporate this into the model, we can set the utility of the two leaf nodes
which are associated with having a stent. Because we are changing the property
of two nodes in the tree, and not changing the tree structure, we do not have to
rebuild the tree.

```{r}
#| echo = TRUE
du_stent <- 0.05
leaf_node_diet_stent$set_utility(1.0 - du_stent)
leaf_node_exercise_stent$set_utility(1.0 - du_stent)
rs <- dt$evaluate()
```

```{r}
with(data = rs, expr = {
  data.frame(
    Programme = Programme,
    Probability = round(Probability, digits = 2L),
    Cost = round(Cost, digits = 2L),
    QALY = round(QALY, digits = 4L),
    stringsAsFactors = FALSE
  )
})
```

In this case, while the Diet strategy is preferred from a cost perspective, 
the utility of the Exercise strategy is superior. `rdecision` also calculates 
Quality-adjusted life-years (QALYs) taking into account the time horizon of 
the model (in this case, the default of one year was used, and therefore 
QALYs correspond to the Utility values). From these figures, the Incremental 
cost-effectiveness ratio (ICER) can be easily calculated:

```{r}
#| echo = TRUE
delta_c <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]] -
  rs[[which(rs[, "Programme"] == "Diet"), "Cost"]]
delta_u <- rs[[which(rs[, "Programme"] == "Exercise"), "Utility"]] -
  rs[[which(rs[, "Programme"] == "Diet"), "Utility"]]
icer <- delta_c / delta_u
```

```{r}
e_du <- du_stent * (p_exercise - p_diet)
e_icer <- (e_netc_exercise - e_netc_diet) / e_du
```

```{r}
#| purl = FALSE
# test that ICER agrees with direct calculation
stopifnot(
  all.equal(icer, e_icer, tolerance = 100.0, scale = 1.0)
)
```

resulting in a cost of `r round(icer, 2L)` GBP per QALY gained in choosing the 
more effective Exercise strategy over the cheaper Diet strategy.

This can be verified by direct calculation, by dividing the difference in net
costs of the two programmes (`r gbp(e_netc_exercise - e_netc_diet, p = TRUE)`
GBP) by the increase in QALYs due to stents saved (`r round(du_stent, 3L)`
multiplied by the difference in success rates of the programme, 
(`r round(p_exercise, 3L)` - `r round(p_diet, 3L)`), or `r round(e_du, 3L)`
QALYs), an ICER of  `r gbp(e_icer, p = TRUE)` GBP / QALY.

# Introducing probabilistic elements

The model shown above uses a fixed value for each parameter, resulting in a 
single point estimate for each model result. However, parameters may be 
affected by uncertainty: for example, the success probability of each strategy
is extracted from a small trial of few patients. This uncertainty can be
incorporated into the Decision Tree model by representing individual parameters
with a statistical distribution, then repeating the evaluation of the model
multiple times with each run randomly drawing parameters from these defined
distributions.

In `rdecision`, model variables that are described by a distribution are
represented by `ModVar` objects. Many commonly used distributions, such as the
Normal, Log-Normal, Gamma and Beta distributions are included in the package,
and additional distributions can be easily implemented from the generic 
`ModVar` class. Additionally, model variables that are calculated from other
r probabilistic variables using an expression can be represented as `ExprModVar`
objects.

Fixed costs can be left as numerical values, or also be represented by `ModVar`s
which ensures that they are included in variable tabulations.

```{r}
#| echo = TRUE
cost_diet <- ConstModVar$new("Cost of diet programme", "GBP", 50.0)
cost_exercise <- ConstModVar$new("Cost of exercise programme", "GBP", 750.0)
cost_stent <- ConstModVar$new("Cost of stent intervention", "GBP", 5000.0)
```

In our simplified example, the probability of success of each strategy should
include the uncertainty associated with the small sample that they are based on.
This can be represented statistically by a Beta distribution, a probability
distribution constrained to the interval [0, 1]. A Beta distribution that
captures the results of the trials can be defined by the _alpha_ (observed
successes) and _beta_ (observed failures) parameters. 

```{r}
#| echo = TRUE
p_diet <- BetaModVar$new(
  alpha = s_diet, beta = f_diet, description = "P(diet)", units = ""
)
p_exercise <- BetaModVar$new(
  alpha = s_exercise, beta = f_exercise, description = "P(exercise)", units = ""
)
```

These distributions describe the probability of success of each strategy; by
the constraints of a Decision Tree, the sum of all probabilities associated
with a chance node must be 1. By leaving the probabilities of the reaction edges
as `NA_real`, `rdecision` will ensure this for each run, even when the
probability of one edge is represented by a model variable.

The newly defined `ModVars` can be incorporated into the Decision Tree model
using the same grammar as the non-probabilistic model. Because the actions and
reactions are objects already included in the tree, we can change their
properties using `set_` calls and those new properties will be used when the
tree is evaluated.

```{r}
#| echo = TRUE
action_diet$set_cost(cost_diet)
action_exercise$set_cost(cost_exercise)
```

```{r}
#| echo = TRUE
reaction_diet_success$set_probability(p_diet)
reaction_diet_failure$set_cost(cost_stent)

reaction_exercise_success$set_probability(p_exercise)
reaction_exercise_failure$set_cost(cost_stent)
```

All the probabilistic variables included in the model can be tabulated using
the `modvar_table()` method, which details the distribution definition and
some useful parameters, such as mean, SD and 95% CI.

```{r}
with(data = dt$modvar_table(), expr = {
  data.frame(
    Description = Description,
    Units = Units,
    Distribution = Distribution,
    Mean = Mean,
    SD = SD,
    Q2.5 = Q2.5,
    Q97.5 = Q97.5,
    stringsAsFactors = FALSE
  )
})
```

A call to the `evaluate()` method with the default settings uses the expected
(mean) value of each variable, and so replicates the point estimate above.

```{r}
#| echo = TRUE
rs <- dt$evaluate()
```

```{r}
#| purl = FALSE
# test that base case evaluation agrees with direct calculation
local({
  # direct calculations
  o_netc_diet <- rs[[which(rs[, "Programme"] == "Diet"), "Cost"]]
  o_netc_exercise <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]]
  o_deltac <- o_netc_exercise - o_netc_diet
  # check
  stopifnot(
    all.equal(o_netc_diet, e_netc_diet, tolerance = 5.0, scale = 1.0),
    all.equal(o_netc_exercise, e_netc_exercise, tolerance = 5.0, scale = 1.0),
    all.equal(o_deltac, e_deltac, tolerance = 1.0, scale = 1.0)
  )
})
```

```{r}
with(data = rs, expr = {
  data.frame(
    Programme = Programme,
    Probability = round(Probability, digits = 2L),
    Cost = round(Cost, digits = 2L),
    QALY = round(QALY, digits = 4L),
    stringsAsFactors = FALSE
  )
})
```

However, because each variable is described by a distribution, it is now
possible to explore the range of possible values consistent with the model.
For example, a lower and upper bound can be estimated by setting each variable
to its 2.5-th or 97.5-th percentile:

```{r}
#| echo = TRUE
rs_025 <- dt$evaluate(setvars = "q2.5")
rs_975 <- dt$evaluate(setvars = "q97.5")
```

The costs for each choice when all variables are at their upper and lower
confidence levels are as follows:

```{r}
#| purl = FALSE
data.frame(
  Q2.5 = round(rs_025[, "Cost"], digits = 2L),
  Q97.5 = round(rs_975[, "Cost"], digits = 2L),
  row.names = c("Diet", "Exercise"),
  stringsAsFactors = FALSE
)
```

To sample the possible outcomes in a completely probabilistic way, the
`setvar = "random"` option can be used, which draws a random value from the
distribution of each variable. Repeating this process a sufficiently large
number of times builds a collection of results compatible with the model
definition, which can then be used to calculate ranges and confidence intervals
of the estimated values.

```{r}
#| echo = TRUE
N <- 1000L
rs <- dt$evaluate(setvars = "random", by = "run", N = N)
```

The estimates of cost for each intervention can be plotted as follows:
```{r}
#| echo = TRUE,
#| purl = FALSE
plot(
  rs[, "Cost.Diet"],
  rs[, "Cost.Exercise"],
  pch = 20L,
  xlab = "Cost of diet (GBP)", ylab = "Cost of exercise (GBP)",
  main = paste(N, "simulations of vascular disease prevention model")
)
abline(a = 0.0, b = 1.0, col = "red")
```

A tabular summary is as follows:
```{r}
local({
  data.frame(
    Cost.Diet = round(unclass(summary(rs[, "Cost.Diet"])), digits = 2L),
    Cost.Exercise = round(unclass(summary(rs[, "Cost.Exercise"])), digits = 2L),
    row.names = names(summary(rs[, "Cost.Diet"])),
    stringsAsFactors = FALSE
  )
})
```

The variables can be further manipulated, for example calculating the 
difference in cost between the two strategies for each run of the 
randomised model:

```{r}
#| echo = TRUE
rs[, "Difference"] <- rs[, "Cost.Diet"] - rs[, "Cost.Exercise"]
CI <- quantile(rs[, "Difference"], c(0.025, 0.975))
```

```{r}
#| echo = TRUE
hist(
  rs[, "Difference"], 100L,  main = "Distribution of saving",
  xlab = "Saving (GBP)"
)
```

```{r}
#| echo = FALSE
with(data = rs[1L : 10L, ], expr = {
  data.frame(
    Run = Run,
    Cost.Diet = round(Cost.Diet, digits = 2L),
    Cost.Exercise = round(Cost.Exercise, digits = 2L),
    Difference = round(Difference, digits = 2L)
  )
})
```

Plotting the distribution of the difference of the two costs reveals that, in
this model, the uncertainties in the input parameters are large enough that
either strategy could have a lower net cost, within a 95% confidence interval
[`r round(CI[[1]], 2L)`, `r round(CI[[2L]], 2L)`].

## Univariate threshold analysis

`rdecision` provides a `threshold` method to compare two strategies and
identify, for a given variable, the value at which one strategy becomes
cost saving over the other:

```{r}
#| echo = TRUE
cost_threshold <- dt$threshold(
  index = list(action_exercise),
  ref = list(action_diet),
  outcome = "saving",
  mvd = cost_exercise$description(),
  a = 0.0, b = 5000.0, tol = 0.1
)

success_threshold <- dt$threshold(
  index = list(action_exercise),
  ref = list(action_diet),
  outcome = "saving",
  mvd = p_exercise$description(),
  a = 0.0, b = 1.0, tol = 0.001
)
```

```{r}
#| purl = FALSE
# test that thresholds are evaluated correctly
stopifnot(
  all.equal(cost_threshold, e_cost_threshold, tolerance = 5.0, scale = 1.0),
  all.equal(
    success_threshold, e_success_threshold, tolerance = 0.03, scale = 1.0
  )
)
```

By univariate threshold analysis, the exercise program will be cost saving 
when its cost of delivery is less than `r round(cost_threshold, 2L)` GBP or when
its success rate is greater than `r round(100.0 * success_threshold, 1L)`%.

These can be verified by direct calculation. The cost of delivering the
exercise programme at which its net cost equals the net cost of the diet
programme is when the difference between the two programme delivery costs
multiplied by the number needed to treat becomes equal to the cost saved by
avoiding one stent. This is `r gbp(e_cost_threshold, p = TRUE)`, in agreement
with the model. The threshold success rate for the exercise programme is when
the number needed to treat is reduced such that the net cost of the two
programmes is equal, i.e., to `r round(nnt_threshold, 2L)`, from which we can
calculate the programme success rate threshold as 
`r round(e_success_threshold, 2L)`, in agreement with the model.