--- title: "Introduction to the construction of decision trees" author: "Paola Cognigni and Andrew Sims" date: "October 2021" bibliography: "REFERENCES.bib" csl: "nature-no-et-al.csl" output: rmarkdown::html_vignette: fig_width: 7 fig_height: 5 fig_caption: true df_print: kable vignette: > %\VignetteIndexEntry{Introduction to the construction of decision trees} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| include = FALSE, #| purl = FALSE knitr::opts_chunk$set( collapse = TRUE, echo = FALSE, comment = "#>" ) ``` ```{r} #| purl = FALSE #nolint start ``` ```{r} library("rdecision") ``` ```{r} #| purl = FALSE #nolint end ``` # Decision Tree representations A decision tree is a decision model that represents all possible pathways through sequences of events (**nodes**), which can be under the experimenter's control (decisions) or not (chances). A decision tree can be represented visually according to a standardised grammar: * **Decision nodes** (represented graphically by a square $\square$): these represent alternative paths that the model should compare, for example different treatment plans. Each Decision node must be the source of two or more Actions. A decision tree can have one or more Decision nodes, which determine the possible strategies that the model compares. * **Chance nodes** (represented graphically by a circle $\bigcirc$): these represent alternative paths that are out of the experiment's control, for example the probability of developing a certain side effect. Each Chance node must be the source of one or more Reactions, each with a specified probability. The probability of Reactions originating from a single Chance node must sum to 1. * **Leaf nodes** (represented graphically by a triangle $\lhd$): these represent the final outcomes of a path. No further Actions or Reactions can occur after a Leaf node. A Leaf node can have a utility value (to a maximum of 1, indicating perfect utility) and an interval over which the utility applies. Nodes are linked by **edges**: * **Actions** arise from Decision nodes, and * **Reactions** arise from Chance nodes. `rdecision` builds a Decision Tree model by defining these elements and their relationships. For example, consider the fictitious and idealized decision problem, introduced in the package README file, of choosing between providing two forms of lifestyle advice, offered to people with vascular disease, which reduce the risk of needing an interventional procedure. The cost to a healthcare provider of the interventional procedure (e.g., inserting a stent) is 5000 GBP; the cost of providing the current form of lifestyle advice, an appointment with a dietician (“diet”), is 50 GBP and the cost of providing an alternative form, attendance at an exercise programme (“exercise”), is 750 GBP. If an advice programme is successful, there is no need for an interventional procedure. These costs can be defined as scalar variables, as follows: ```{r} #| echo = TRUE cost_diet <- 50.0 cost_exercise <- 750.0 cost_stent <- 5000.0 ``` The model for this fictional scenario can be defined by the following elements: * Decision node: which programme to enrol the patient in. ```{r} #| echo = TRUE decision_node <- DecisionNode$new("Programme") ``` * Chance nodes: the chance that the patient will need an interventional procedure. This is different for the two programmes, so two chance nodes must be defined. ```{r} #| echo = TRUE chance_node_diet <- ChanceNode$new("Outcome") chance_node_exercise <- ChanceNode$new("Outcome") ``` * Leaf nodes: the possible final states of the model, depending both on the decision (which programme) and the chance of needing an intervention. Here, we assume that the model has a time horizon of 1 year, and that the utility is the same for all patients (the default values). ```{r} #| echo = TRUE leaf_node_diet_no_stent <- LeafNode$new("No intervention") leaf_node_diet_stent <- LeafNode$new("Intervention") leaf_node_exercise_no_stent <- LeafNode$new("No intervention") leaf_node_exercise_stent <- LeafNode$new("Intervention") ``` These nodes can then be wired into a decision tree graph by defining the edges that link pairs of nodes as actions or reactions. ## Actions These are the two programmes being tested. The cost of each action, as described in the example, is embedded into the action definition. ```{r} #| echo = TRUE action_diet <- Action$new( decision_node, chance_node_diet, cost = cost_diet, label = "Diet" ) action_exercise <- Action$new( decision_node, chance_node_exercise, cost = cost_exercise, label = "Exercise" ) ``` ## Reactions These are the possible outcomes of each programme (success or failure), with their relevant probabilities. To continue our fictional example, in a small trial of the “diet” programme, 12 out of 68 patients (17.6%) avoided having an interventional procedure within one year, and in a separate small trial of the “exercise” programme 18 out of 58 patients (31.0%) avoided the interventional procedure within one year (it is assumed that the baseline characteristics in the two trials were comparable). ```{r} #| echo = TRUE s_diet <- 12L f_diet <- 56L s_exercise <- 18L f_exercise <- 40L ``` Epidemiologically, we can interpret the trial results in terms of the incidence proportions of having an adverse event (needing a stent) within one year, for each treatment strategy. ```{r} #| echo = TRUE ip_diet <- f_diet / (s_diet + f_diet) ip_exercise <- f_exercise / (s_exercise + f_exercise) nnt <- 1.0 / (ip_diet - ip_exercise) ``` The incidence proportions are `r round(ip_diet, 2L)` and `r round(ip_exercise, 2L)` for diet and exercise, respectively, noting that we define a programme failure as the need to insert a stent. The number needed to treat is the reciprocal of the difference in incidence proportions; `r round(nnt, 2L)` people must be allocated to the exercise programme rather than the diet programme to save one adverse event. These trial results can be represented as probabilities of outcome success (`p_diet`, `p_exercise`) derived from the incidence proportions of the trial results. ```{r} #| echo = TRUE p_diet <- 1.0 - ip_diet p_exercise <- 1.0 - ip_exercise ``` These probabilities, as well as the cost associated with each outcome, can then be embedded into the reaction definition. The probabilities of traversing the reactions for programme failure are set to `NA_real_`, which leaves `rdecision` to calculate them at each evaluation of the decision tree, ensuring that the total probability of leaving each chance node adds to 1. ```{r} #| echo = TRUE reaction_diet_success <- Reaction$new( chance_node_diet, leaf_node_diet_no_stent, p = p_diet, cost = 0.0, label = "Success" ) reaction_diet_failure <- Reaction$new( chance_node_diet, leaf_node_diet_stent, p = NA_real_, cost = cost_stent, label = "Failure" ) reaction_exercise_success <- Reaction$new( chance_node_exercise, leaf_node_exercise_no_stent, p = p_exercise, cost = 0.0, label = "Success" ) reaction_exercise_failure <- Reaction$new( chance_node_exercise, leaf_node_exercise_stent, p = NA_real_, cost = cost_stent, label = "Failure" ) ``` When all the elements are defined and satisfy the restrictions of a Decision Tree (see the documentation for the `DecisionTree` class for details), the whole model can be built: ```{r} #| echo = TRUE dt <- DecisionTree$new( V = list( decision_node, chance_node_diet, chance_node_exercise, leaf_node_diet_no_stent, leaf_node_diet_stent, leaf_node_exercise_no_stent, leaf_node_exercise_stent ), E = list( action_diet, action_exercise, reaction_diet_success, reaction_diet_failure, reaction_exercise_success, reaction_exercise_failure ) ) ``` `rdecision` includes a `draw` method to generate a diagram of a defined Decision Tree. ```{r} #| echo = TRUE dt$draw() ``` # Evaluating a decision tree ## Base case As a decision model, a Decision Tree takes into account the costs, probabilities and utilities encountered as each strategy is traversed from left to right. In this example, only two strategies (Diet or Exercise) exist in the model and can be compared using the `evaluate()` method. ```{r} #| echo = TRUE rs <- dt$evaluate() ``` ```{r} with(data = rs, expr = { data.frame( Programme = Programme, Probability = round(Probability, digits = 2L), Cost = round(Cost, digits = 2L), stringsAsFactors = FALSE ) }) ``` From the evaluation of the two strategies, it is apparent that the Diet strategy has a marginally lower net cost by `r round(rs[[2L, "Cost"]] - rs[[1L, "Cost"]], 2L)` GBP. ```{r} o_netc_diet <- rs[[which(rs[, "Programme"] == "Diet"), "Cost"]] e_netc_diet <- cost_diet + (1.0 - p_diet) * cost_stent o_netc_exercise <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]] e_netc_exercise <- cost_exercise + (1.0 - p_exercise) * cost_stent incc <- nnt * (cost_exercise - cost_diet) o_deltac <- o_netc_exercise - o_netc_diet e_deltac <- (incc - cost_stent) / nnt e_cost_threshold <- (cost_stent / nnt) + cost_diet nnt_threshold <- cost_stent / (cost_exercise - cost_diet) e_success_threshold <- 1.0 - (ip_diet - (1.0 / nnt_threshold)) ``` ```{r} #| purl = FALSE # test that base case evaluation agrees with direct calculation stopifnot( all.equal(o_netc_diet, e_netc_diet, tolerance = 5.0, scale = 1.0), all.equal(o_netc_exercise, e_netc_exercise, tolerance = 5.0, scale = 1.0), all.equal(o_deltac, e_deltac, tolerance = 1.0, scale = 1.0) ) ``` Because this example is structurally simple, we can verify the results by direct calculation. The net cost per patient of the diet programme is the cost of delivering the advice (`r gbp(cost_diet, p = TRUE)` GBP) plus the cost of inserting a stent (`r gbp(cost_stent, p = TRUE)` GBP) multiplied by the proportion who require a stent, or the failure rate of the programme, `r round((1.0 - p_diet), 2L)`, equal to `r gbp(e_netc_diet, p = TRUE)` GBP. By a similar argument, the net cost per patient of the exercise programme is `r gbp(e_netc_exercise, p = TRUE)` GBP, as the model predicts. If `r round(nnt, 2L)` patients are required to change from the diet programme to the exercise programme to save one stent, the incremental increase in cost of delivering the advice is the number needed to treat multiplied by the difference in the cost of the programmes, `r gbp(cost_exercise, p = TRUE)` GBP - `r gbp(cost_diet, p = TRUE)` GBP, or `r gbp(incc, p = TRUE)` GBP. Because this is greater than the cost saved by avoiding one stent (`r gbp(cost_stent, p = TRUE)` GBP), we can see that the additional net cost per patient of delivering the programme is the difference between these costs, divided by the number needed to treat, or `r gbp(e_deltac, p = TRUE)` GBP, as the model predicts. Note that this approach aggregates multiple paths that belong to the same strategy (for example, the Success and Failure paths of the Diet strategy). The option `by = "path"` can be used to evaluate each path separately. ```{r} #| echo = TRUE rp <- dt$evaluate(by = "path") ``` ```{r} with(data = rp, expr = { data.frame( Programme = Programme, Leaf = Leaf, Probability = round(Probability, digits = 2L), Cost = round(Cost, digits = 2L), stringsAsFactors = FALSE ) }) ``` ## Adjustment for disutility Cost is not the only consideration that can be modelled using a Decision Tree. Suppose that requiring an intervention reduces the quality of life of patients, associated with attending pre-operative appointments, pain and discomfort of the procedure, and adverse events. This is estimated to be associated with a mean disutility of 0.05 for those who receive a stent, assumed to persist over 1 year. To incorporate this into the model, we can set the utility of the two leaf nodes which are associated with having a stent. Because we are changing the property of two nodes in the tree, and not changing the tree structure, we do not have to rebuild the tree. ```{r} #| echo = TRUE du_stent <- 0.05 leaf_node_diet_stent$set_utility(1.0 - du_stent) leaf_node_exercise_stent$set_utility(1.0 - du_stent) rs <- dt$evaluate() ``` ```{r} with(data = rs, expr = { data.frame( Programme = Programme, Probability = round(Probability, digits = 2L), Cost = round(Cost, digits = 2L), QALY = round(QALY, digits = 4L), stringsAsFactors = FALSE ) }) ``` In this case, while the Diet strategy is preferred from a cost perspective, the utility of the Exercise strategy is superior. `rdecision` also calculates Quality-adjusted life-years (QALYs) taking into account the time horizon of the model (in this case, the default of one year was used, and therefore QALYs correspond to the Utility values). From these figures, the Incremental cost-effectiveness ratio (ICER) can be easily calculated: ```{r} #| echo = TRUE delta_c <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]] - rs[[which(rs[, "Programme"] == "Diet"), "Cost"]] delta_u <- rs[[which(rs[, "Programme"] == "Exercise"), "Utility"]] - rs[[which(rs[, "Programme"] == "Diet"), "Utility"]] icer <- delta_c / delta_u ``` ```{r} e_du <- du_stent * (p_exercise - p_diet) e_icer <- (e_netc_exercise - e_netc_diet) / e_du ``` ```{r} #| purl = FALSE # test that ICER agrees with direct calculation stopifnot( all.equal(icer, e_icer, tolerance = 100.0, scale = 1.0) ) ``` resulting in a cost of `r round(icer, 2L)` GBP per QALY gained in choosing the more effective Exercise strategy over the cheaper Diet strategy. This can be verified by direct calculation, by dividing the difference in net costs of the two programmes (`r gbp(e_netc_exercise - e_netc_diet, p = TRUE)` GBP) by the increase in QALYs due to stents saved (`r round(du_stent, 3L)` multiplied by the difference in success rates of the programme, (`r round(p_exercise, 3L)` - `r round(p_diet, 3L)`), or `r round(e_du, 3L)` QALYs), an ICER of `r gbp(e_icer, p = TRUE)` GBP / QALY. # Introducing probabilistic elements The model shown above uses a fixed value for each parameter, resulting in a single point estimate for each model result. However, parameters may be affected by uncertainty: for example, the success probability of each strategy is extracted from a small trial of few patients. This uncertainty can be incorporated into the Decision Tree model by representing individual parameters with a statistical distribution, then repeating the evaluation of the model multiple times with each run randomly drawing parameters from these defined distributions. In `rdecision`, model variables that are described by a distribution are represented by `ModVar` objects. Many commonly used distributions, such as the Normal, Log-Normal, Gamma and Beta distributions are included in the package, and additional distributions can be easily implemented from the generic `ModVar` class. Additionally, model variables that are calculated from other r probabilistic variables using an expression can be represented as `ExprModVar` objects. Fixed costs can be left as numerical values, or also be represented by `ModVar`s which ensures that they are included in variable tabulations. ```{r} #| echo = TRUE cost_diet <- ConstModVar$new("Cost of diet programme", "GBP", 50.0) cost_exercise <- ConstModVar$new("Cost of exercise programme", "GBP", 750.0) cost_stent <- ConstModVar$new("Cost of stent intervention", "GBP", 5000.0) ``` In our simplified example, the probability of success of each strategy should include the uncertainty associated with the small sample that they are based on. This can be represented statistically by a Beta distribution, a probability distribution constrained to the interval [0, 1]. A Beta distribution that captures the results of the trials can be defined by the _alpha_ (observed successes) and _beta_ (observed failures) parameters. ```{r} #| echo = TRUE p_diet <- BetaModVar$new( alpha = s_diet, beta = f_diet, description = "P(diet)", units = "" ) p_exercise <- BetaModVar$new( alpha = s_exercise, beta = f_exercise, description = "P(exercise)", units = "" ) ``` These distributions describe the probability of success of each strategy; by the constraints of a Decision Tree, the sum of all probabilities associated with a chance node must be 1. By leaving the probabilities of the reaction edges as `NA_real`, `rdecision` will ensure this for each run, even when the probability of one edge is represented by a model variable. The newly defined `ModVars` can be incorporated into the Decision Tree model using the same grammar as the non-probabilistic model. Because the actions and reactions are objects already included in the tree, we can change their properties using `set_` calls and those new properties will be used when the tree is evaluated. ```{r} #| echo = TRUE action_diet$set_cost(cost_diet) action_exercise$set_cost(cost_exercise) ``` ```{r} #| echo = TRUE reaction_diet_success$set_probability(p_diet) reaction_diet_failure$set_cost(cost_stent) reaction_exercise_success$set_probability(p_exercise) reaction_exercise_failure$set_cost(cost_stent) ``` All the probabilistic variables included in the model can be tabulated using the `modvar_table()` method, which details the distribution definition and some useful parameters, such as mean, SD and 95% CI. ```{r} with(data = dt$modvar_table(), expr = { data.frame( Description = Description, Units = Units, Distribution = Distribution, Mean = Mean, SD = SD, Q2.5 = Q2.5, Q97.5 = Q97.5, stringsAsFactors = FALSE ) }) ``` A call to the `evaluate()` method with the default settings uses the expected (mean) value of each variable, and so replicates the point estimate above. ```{r} #| echo = TRUE rs <- dt$evaluate() ``` ```{r} #| purl = FALSE # test that base case evaluation agrees with direct calculation local({ # direct calculations o_netc_diet <- rs[[which(rs[, "Programme"] == "Diet"), "Cost"]] o_netc_exercise <- rs[[which(rs[, "Programme"] == "Exercise"), "Cost"]] o_deltac <- o_netc_exercise - o_netc_diet # check stopifnot( all.equal(o_netc_diet, e_netc_diet, tolerance = 5.0, scale = 1.0), all.equal(o_netc_exercise, e_netc_exercise, tolerance = 5.0, scale = 1.0), all.equal(o_deltac, e_deltac, tolerance = 1.0, scale = 1.0) ) }) ``` ```{r} with(data = rs, expr = { data.frame( Programme = Programme, Probability = round(Probability, digits = 2L), Cost = round(Cost, digits = 2L), QALY = round(QALY, digits = 4L), stringsAsFactors = FALSE ) }) ``` However, because each variable is described by a distribution, it is now possible to explore the range of possible values consistent with the model. For example, a lower and upper bound can be estimated by setting each variable to its 2.5-th or 97.5-th percentile: ```{r} #| echo = TRUE rs_025 <- dt$evaluate(setvars = "q2.5") rs_975 <- dt$evaluate(setvars = "q97.5") ``` The costs for each choice when all variables are at their upper and lower confidence levels are as follows: ```{r} #| purl = FALSE data.frame( Q2.5 = round(rs_025[, "Cost"], digits = 2L), Q97.5 = round(rs_975[, "Cost"], digits = 2L), row.names = c("Diet", "Exercise"), stringsAsFactors = FALSE ) ``` To sample the possible outcomes in a completely probabilistic way, the `setvar = "random"` option can be used, which draws a random value from the distribution of each variable. Repeating this process a sufficiently large number of times builds a collection of results compatible with the model definition, which can then be used to calculate ranges and confidence intervals of the estimated values. ```{r} #| echo = TRUE N <- 1000L rs <- dt$evaluate(setvars = "random", by = "run", N = N) ``` The estimates of cost for each intervention can be plotted as follows: ```{r} #| echo = TRUE, #| purl = FALSE plot( rs[, "Cost.Diet"], rs[, "Cost.Exercise"], pch = 20L, xlab = "Cost of diet (GBP)", ylab = "Cost of exercise (GBP)", main = paste(N, "simulations of vascular disease prevention model") ) abline(a = 0.0, b = 1.0, col = "red") ``` A tabular summary is as follows: ```{r} local({ data.frame( Cost.Diet = round(unclass(summary(rs[, "Cost.Diet"])), digits = 2L), Cost.Exercise = round(unclass(summary(rs[, "Cost.Exercise"])), digits = 2L), row.names = names(summary(rs[, "Cost.Diet"])), stringsAsFactors = FALSE ) }) ``` The variables can be further manipulated, for example calculating the difference in cost between the two strategies for each run of the randomised model: ```{r} #| echo = TRUE rs[, "Difference"] <- rs[, "Cost.Diet"] - rs[, "Cost.Exercise"] CI <- quantile(rs[, "Difference"], c(0.025, 0.975)) ``` ```{r} #| echo = TRUE hist( rs[, "Difference"], 100L, main = "Distribution of saving", xlab = "Saving (GBP)" ) ``` ```{r} #| echo = FALSE with(data = rs[1L : 10L, ], expr = { data.frame( Run = Run, Cost.Diet = round(Cost.Diet, digits = 2L), Cost.Exercise = round(Cost.Exercise, digits = 2L), Difference = round(Difference, digits = 2L) ) }) ``` Plotting the distribution of the difference of the two costs reveals that, in this model, the uncertainties in the input parameters are large enough that either strategy could have a lower net cost, within a 95% confidence interval [`r round(CI[[1]], 2L)`, `r round(CI[[2L]], 2L)`]. ## Univariate threshold analysis `rdecision` provides a `threshold` method to compare two strategies and identify, for a given variable, the value at which one strategy becomes cost saving over the other: ```{r} #| echo = TRUE cost_threshold <- dt$threshold( index = list(action_exercise), ref = list(action_diet), outcome = "saving", mvd = cost_exercise$description(), a = 0.0, b = 5000.0, tol = 0.1 ) success_threshold <- dt$threshold( index = list(action_exercise), ref = list(action_diet), outcome = "saving", mvd = p_exercise$description(), a = 0.0, b = 1.0, tol = 0.001 ) ``` ```{r} #| purl = FALSE # test that thresholds are evaluated correctly stopifnot( all.equal(cost_threshold, e_cost_threshold, tolerance = 5.0, scale = 1.0), all.equal( success_threshold, e_success_threshold, tolerance = 0.03, scale = 1.0 ) ) ``` By univariate threshold analysis, the exercise program will be cost saving when its cost of delivery is less than `r round(cost_threshold, 2L)` GBP or when its success rate is greater than `r round(100.0 * success_threshold, 1L)`%. These can be verified by direct calculation. The cost of delivering the exercise programme at which its net cost equals the net cost of the diet programme is when the difference between the two programme delivery costs multiplied by the number needed to treat becomes equal to the cost saved by avoiding one stent. This is `r gbp(e_cost_threshold, p = TRUE)`, in agreement with the model. The threshold success rate for the exercise programme is when the number needed to treat is reduced such that the net cost of the two programmes is equal, i.e., to `r round(nnt_threshold, 2L)`, from which we can calculate the programme success rate threshold as `r round(e_success_threshold, 2L)`, in agreement with the model.