---
title: "Creating drug cohorts"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Use DrugUtilisation to create a cohort}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
  markdown: 
    wrap: 72
---

```{r , include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

In this vignette we will introduce how to create a drug users cohorts. A cohort is a set of people that satisfy a certain inclusion criteria during a certain time frame. The cohort object is defined in : `vignette("cdm_reference", package = "omopgenerics")`.

The function `generateDrugUtilisationCohortSet` is used to generate cohorts of drug users based on the `drug_exposure` table and a conceptSet.

These cohorts can be subsetted to the exposures of interest applying the different inclusion criteria:

- Require that entries are in a certain date range `requireDrugInDateRange()`.

- Subset to the first entry `requireIsFirstDrugEntry()`.

- Require a certain time in observation before the entries `requireObservationBeforeDrug()`.

- Require a certain time before exposure `requirePriorDrugWashout()`.

# Creating a `cdm_reference` object

The first thing that we need is a `cdm_reference` object to our OMOP CDM instance. You can learn how to create cdm references using CDMConnector here: `vignette("a04_DBI_connection_examples", package = "CDMConnector")`.

The DrugUtilisation packages contains some mock data that can be useful to test the package:

```{r}
library(DrugUtilisation)

cdm <- mockDrugUtilisation(numberIndividuals = 100, seed = 1)

cdm
```

# Create a drug users cohort

To create a basic drug users cohort we need two things:

- A conceptSet: will determine which concepts we will use.
- A gapEra: will determine how we will collapse those exposures.

## Creating a conceptSet

There are three possible forms of a conceptSet:

- A named list of concept ids
```{r}
conceptSet <- list(acetaminophen = c(1, 2, 3))
conceptSet
```

- A `codelist` object, see `vignette("codelists", package = "omopgenerics")`
```{r}
conceptSet <- list(acetaminophen = c(1, 2, 3)) |> omopgenerics::newCodelist()
conceptSet
conceptSet$acetaminophen
```

- A `conceptSetExpression` object, see `vignette("codelists", package = "omopgenerics")`

```{r}
conceptSet <- list(acetaminophen = dplyr::tibble(
  concept_id = 1125315,
  excluded = FALSE,
  descendants = TRUE,
  mapped = FALSE
)) |>
  omopgenerics::newConceptSetExpression()
conceptSet
conceptSet$acetaminophen
```

The package [CodelistGenerator](https://cran.r-project.org/package=CodelistGenerator)
can be very useful to create conceptSet.

```{r}
library(CodelistGenerator)
```

For example we can create a conceptSet based in an ingredient with `getDrugIngredientCodes()`:
```{r}
codes <- getDrugIngredientCodes(cdm = cdm, name = "acetaminophen")
codes[["161_acetaminophen"]]
```

We could also use the function `codesFromConceptSet()` to read a concept set from a json file:
```{r}
codes <- codesFromConceptSet(path = system.file("acetaminophen.json", package = "DrugUtilisation"), cdm = cdm)
codes
```

## The gapEra parameter

The `gapEra` parameter is used to join exposures into episodes, let's say for example we have an individual with 4 drug exposures that we are interested in. The first two overlap each other, then there is a gap of 29 days and two consecutive exposures:

```{r, echo = FALSE, fig.width = 7, fig.height = 4}
dplyr::tibble(
  drug_exposure_id = c(1, 2, 3, 4),
  start = as.Date(c("2020-01-01", "2020-01-20", "2020-03-15", "2020-04-20")),
  end = as.Date(c("2020-01-30", "2020-02-15", "2020-04-19", "2020-05-15"))
) |>
  tidyr::pivot_longer(c("start", "end")) |>
  dplyr::mutate(lab = format(value, "%d %b")) |>
  ggplot2::ggplot(ggplot2::aes(x = value, y = drug_exposure_id, group = drug_exposure_id, label = lab)) +
  ggplot2::geom_point(size = 5) +
  ggplot2::geom_line(linewidth = 2) +
  ggplot2::lims(y = c(0.5, 4.5)) +
  ggplot2::geom_text(nudge_y = 0.2) +
  ggplot2::geom_line(
    data = dplyr::tibble(
      y = c(1.5, 2.5, 3.5),
      type = c("overlap", "gap", "gap"),
      start = as.Date(c("2020-01-20", "2020-03-15", "2020-04-20")),
      end = as.Date(c("2020-01-30", "2020-02-15", "2020-04-19"))
    ) |>
      tidyr::pivot_longer(c("start", "end")),
    mapping = ggplot2::aes(x = value, y = y, group = y, color = type),
    inherit.aes = FALSE,
    linewidth = 2
  ) +
  ggtext::geom_richtext(
    data = dplyr::tibble(
      y = c(1.5, 2.5, 3.5),
      type = c("overlap", "gap", "gap"),
      lab = c("overlap", "gap = 29 days", "gap = 1 day"),
      start = as.Date(c("2020-01-25", "2020-03-01", "2020-04-19"))
    ),
    mapping = ggplot2::aes(x = start, y = y, group = y, color = type, label = lab),
    inherit.aes = FALSE,
    nudge_y = 0.2
  ) +
  # ggplot2::ggtitle("subject_id = 1") +
  ggplot2::ylab("Drug exposure id") +
  ggplot2::xlab("Time") +
  ggplot2::theme(legend.position = "none")
```

If we would create the episode with **gapEra = 0**, we would have 3 resultant episodes, the first two that overlap would be joined in a single episode, but then the other two would be independent:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x <- dplyr::tibble(
  "cohort_definition_id" = 1L,
  "subject_id" = 1L,
  "cohort_start_date" = as.Date(c("2020-01-01", "2020-03-15", "2020-04-20")),
  "cohort_end_date" = as.Date(c("2020-02-15", "2020-04-19", "2020-05-15"))
)
print(x)
x |>
  dplyr::mutate(record_id = dplyr::row_number()) |>
  tidyr::pivot_longer(c("cohort_start_date", "cohort_end_date")) |>
  ggplot2::ggplot(ggplot2::aes(x = value, y = 1, group = record_id, label = value)) +
  ggplot2::geom_line(linewidth = 1.5) +
  ggplot2::geom_point(size = 3, shape = 15, ggplot2::aes(color = name)) +
  ggplot2::lims(y = c(0.5, 1.5)) +
  ggplot2::ylab(" ") +
  ggplot2::xlab("Time") +
  ggplot2::theme(legend.position = "none", axis.ticks.y = ggplot2::element_blank())
```

If, instead we would use a **gapEra = 1**, we would have 2 resultant episodes, the first two that overlap would be joined in a single episode (as before), now the two consecutive exposures would be joined in a single episode:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x <- dplyr::tibble(
  "cohort_definition_id" = 1L,
  "subject_id" = 1L,
  "cohort_start_date" = as.Date(c("2020-01-01", "2020-03-15")),
  "cohort_end_date" = as.Date(c("2020-02-15", "2020-05-15"))
)
print(x)
x |>
  dplyr::mutate(record_id = dplyr::row_number()) |>
  tidyr::pivot_longer(c("cohort_start_date", "cohort_end_date")) |>
  ggplot2::ggplot(ggplot2::aes(x = value, y = 1, group = record_id, label = value)) +
  ggplot2::geom_line(linewidth = 1.5) +
  ggplot2::geom_point(size = 3, shape = 15, ggplot2::aes(color = name)) +
  ggplot2::lims(y = c(0.5, 1.5)) +
  ggplot2::ylab(" ") +
  ggplot2::xlab("Time") +
  ggplot2::theme(legend.position = "none", axis.ticks.y = ggplot2::element_blank())
```

The result would be the same for any value between 1 and 28 (**gapEra $\in$ [1, 28]**).

Whereas, if we would use a **gapEra = 29** all the records would be collapsed into a single episode:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x <- dplyr::tibble(
  "cohort_definition_id" = 1L,
  "subject_id" = 1L,
  "cohort_start_date" = as.Date(c("2020-01-01")),
  "cohort_end_date" = as.Date(c("2020-05-15"))
)
print(x)
x |>
  dplyr::mutate(record_id = dplyr::row_number()) |>
  tidyr::pivot_longer(c("cohort_start_date", "cohort_end_date")) |>
  ggplot2::ggplot(ggplot2::aes(x = value, y = 1, group = record_id, label = value)) +
  ggplot2::geom_line(linewidth = 1.5) +
  ggplot2::geom_point(size = 3, shape = 15, ggplot2::aes(color = name)) +
  ggplot2::lims(y = c(0.5, 1.5)) +
  ggplot2::ylab(" ") +
  ggplot2::xlab("Time") +
  ggplot2::theme(legend.position = "none", axis.ticks.y = ggplot2::element_blank())
```

## Create your cohort

We will then create now a cohort with all the drug users of acetaminophen with a gapEra of 30 days.

```{r, messages = TRUE}
codes <- getDrugIngredientCodes(cdm = cdm, name = "acetaminophen")
names(codes) <- "acetaminophen"
cdm <- generateDrugUtilisationCohortSet(cdm = cdm, name = "acetaminophen_cohort", conceptSet = codes, gapEra = 30)
cdm
```

NOTE that the `name` argument is used to create the new table in the cdm object. For database backends this is the name of the table that will be created.

We can compare what we see with what we would expect; if we look at the individual with more records we can see how all of them are joined into a single exposure as the records overlap each other:

```{r}
cdm$drug_exposure |>
  dplyr::filter(drug_concept_id %in% !!codes$acetaminophen & person_id == 69)
```

```{r}
cdm$acetaminophen_cohort |>
  dplyr::filter(subject_id == 69)
```

In this case gapEra did not have a big impact as we can see in the attrition:

```{r}
attrition(cdm$acetaminophen_cohort)
```

We can see this particular case of this individual:

```{r}
cdm$drug_exposure |>
  dplyr::filter(drug_concept_id %in% !!codes$acetaminophen & person_id == 50)
```

In this case we have 3 exposures separated by 3 days, so if we use the 30 days gap both exposures are joined into a single episode, whereas if we would use a gapEra smaller than 3 we would consider them as different episodes.

```{r}
cdm$acetaminophen_cohort |>
  dplyr::filter(subject_id == 50)
```

We can access the other cohort attributes using the adequate functions. In settings we can see that the gapEra used is recorded or with cohortCodelist we can see which was the codelist used to create the cohort.

```{r}
settings(cdm$acetaminophen_cohort)
cohortCount(cdm$acetaminophen_cohort)
cohortCodelist(cdm$acetaminophen_cohort, cohortId = 1)
```

# Apply inclusion criteria to drug cohorts

Once we have created our base cohort using a conceptSet and a gapEra we can apply different restrictions:

- require a prior unexposed time: `requirePriorDrugWashout()`
- require that it is the first entry: `requireIsFirstDrugEntry()`
- require a prior observation in the cdm: `requireObservationBeforeDrug()`
- require that date are within a certain interval: `requireDrugInDateRange()`

## `requirePriorDrugWashout()`

To require that the cohort entries (drug episodes) are incident we would usually define a time (`days`) where the individual is not exposed to the drug. This can be achieved using `requirePriorDrugWashout()` function. In this example we would restrict to individuals with 365 days of no exposure:

```{r}
cdm$acetaminophen_cohort <- cdm$acetaminophen_cohort |>
  requirePriorDrugWashout(days = 365)
```

The result will be a cohort with the individuals that fulfill the criteria:

```{r}
cdm$acetaminophen_cohort
```

This would also get recorded in the attrition, counts and settings.

In the settings a new column with the specified parameter used:

```{r}
settings(cdm$acetaminophen_cohort)
```

The counts will be updated:

```{r}
cohortCount(cdm$acetaminophen_cohort)
```

And the attrition will have a new line:

```{r}
attrition(cdm$acetaminophen_cohort)
```

The `name` argument can be used to put the result into a different table in our cdm (by default the function updates the current cohort table). Whereas the `cohortId` argument is used to apply this criteria to only a restricted set of cohorts (by default the same criteria is applied to all the cohort records). To show this in an example we will create two cohorts (metformin and simvastatin) inside a table named `my_cohort` and then apply the inclusion criteria to only one of them (simvastatin) and save the result to a table named: `my_new_cohort`

```{r}
codes <- getDrugIngredientCodes(cdm = cdm, name = c("metformin", "simvastatin"))
cdm <- generateDrugUtilisationCohortSet(cdm = cdm, name = "my_cohort", conceptSet = codes, gapEra = 30)
cdm
settings(cdm$my_cohort)
cdm$my_new_cohort <- cdm$my_cohort |>
  requirePriorDrugWashout(days = 365, cohortId = 2, name = "my_new_cohort")
cdm
attrition(cdm$my_new_cohort)
```

## `requireIsFirstDrugEntry()`

To require that the cohort entry (drug episodes) is the first one of the available ones we can use the `requireIsFirstDrugEntry()` function. See example:

```{r}
cdm$acetaminophen_cohort <- cdm$acetaminophen_cohort |>
  requireIsFirstDrugEntry()
```

The result will be a cohort with the individuals that fulfill the criteria:

```{r}
cdm$acetaminophen_cohort
```

This would also get recorded in the attrition, counts and settings on top of the already exiting ones.

In the settings a new column with the specified parameter used:

```{r}
settings(cdm$acetaminophen_cohort)
```

The counts will be updated:

```{r}
cohortCount(cdm$acetaminophen_cohort)
```

And the attrition will have a new line:

```{r}
attrition(cdm$acetaminophen_cohort)
```

## `requireObservationBeforeDrug()`

To require that a cohort entry (drug episodes) has a certain time of prior observation we can use the `requireObservationBeforeDrug()` function. See example:

```{r}
cdm$acetaminophen_cohort <- cdm$acetaminophen_cohort |>
  requireObservationBeforeDrug(days = 365)
```

The result will be a cohort with the individuals that fulfill the criteria:

```{r}
cdm$acetaminophen_cohort
```

This would also get recorded in the attrition, counts and settings on top of the already exiting ones.

In the settings a new column with the specified parameter used:

```{r}
settings(cdm$acetaminophen_cohort)
```

The counts will be updated:

```{r}
cohortCount(cdm$acetaminophen_cohort)
```

And the attrition will have a new line:

```{r}
attrition(cdm$acetaminophen_cohort)
```


## `requireDrugInDateRange()`

To require that a cohort entry (drug episodes) has a certain date within an specific range we can use the `requireDrugInDateRange()` function. In general you would like to apply this restriction to the incident date (cohort_start_date), but the function is flexible and you can use it to restrict to any other date. See example:

```{r}
cdm$acetaminophen_cohort <- cdm$acetaminophen_cohort |>
  requireDrugInDateRange(
    indexDate = "cohort_start_date",
    dateRange = as.Date(c("2000-01-01", "2020-12-31"))
  )
```

The result will be a cohort with the individuals that fulfill the criteria:

```{r}
cdm$acetaminophen_cohort
```

This would also get recorded in the attrition, counts and settings on top of the already exiting ones.

In the settings a new column with the specified parameter used:

```{r}
settings(cdm$acetaminophen_cohort)
```

The counts will be updated:

```{r}
cohortCount(cdm$acetaminophen_cohort)
```

And the attrition will have a new line:

```{r}
attrition(cdm$acetaminophen_cohort)
```

If you just want to restrict on the lower or upper bound you can just leave the other element as NA and then no condition will be applied, see for example:

```{r}
cdm$my_new_cohort <- cdm$my_new_cohort |>
  requireDrugInDateRange(dateRange = as.Date(c(NA, "2010-12-31")))
attrition(cdm$my_new_cohort)
```

# The order matters

It is very important to know that the different restrictions are not commutable operations and that different order can lead to different results. Let's see the following example where we have an individual with 4 cohort entries:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x <- dplyr::tibble(
  "cohort_start_date" = as.Date(c("2010-03-01", "2011-02-15", "2012-03-05", "2013-06-15")),
  "cohort_end_date" = as.Date(c("2010-05-15", "2012-01-01", "2012-05-12", "2013-08-12"))
)
y <- dplyr::tibble(
  start = x$cohort_end_date[-nrow(x)] + 1, end = x$cohort_start_date[-1] - 1
) |>
  dplyr::mutate(
    record_id = dplyr::row_number(),
    distance = as.numeric(end - start + 1),
    mean = start + distance / 2,
    distance = paste0(distance, " days")
  ) |>
  tidyr::pivot_longer(c("start", "end"))
x |>
  dplyr::mutate(record_id = dplyr::row_number()) |>
  tidyr::pivot_longer(c("cohort_start_date", "cohort_end_date")) |>
  ggplot2::ggplot(ggplot2::aes(x = value, y = 1, group = record_id, label = value)) +
  ggplot2::geom_line(linewidth = 1.5) +
  ggplot2::lims(y = c(0.8, 1.2)) +
  ggplot2::geom_point(
    ggplot2::aes(x = as.Date("2010-01-01"), y = 1),
    shape = 23, color = "#00AF6A", size = 5, fill = "#00AF6A"
  ) +
  ggplot2::geom_text(
    ggplot2::aes(x = as.Date("2010-01-01"), y = 0.92, label = "Start observation", hjust = "left"),
    color = "#00AF6A"
  ) +
  ggplot2::geom_line(
    data = y,
    ggplot2::aes(x = value, y = 1.05, group = record_id, color = "red"),
    linewidth = 1.1
  ) +
  ggplot2::geom_text(data = y, ggplot2::aes(x = mean, y = 1.09, label = distance), nudge_y = 0.008) +
  ggplot2::xlim(as.Date(c("2010-01-01", "2014-01-01"))) +
  ggplot2::ylab("") +
  ggplot2::xlab("Time") +
  ggplot2::theme(
    legend.position = "none",
    axis.ticks.y = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank()
  )
```

In this case we will see the result of combining in different ways 4 inclusion criteria:

- **first**: `requireIsFirstDrugEntry()`
- **washout**: `requirePriorDrugWashout(days = 365)`
- **minObs**: `requireObservationBeforeDrug(days = 365)`
- **2011-2012** `requireDrugInDateRange(dateRange = as.Date(c("2011-01-01", "2012-12-31)))`

## first and washout

If we would apply the initially the **first** requirement and then the **washout** one we would end with only the first record:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
colours <- c("#ffb3b3", "#2d862d")
doplot <- function(x) {
  x |>
    dplyr::mutate(record_id = dplyr::row_number()) |>
    tidyr::pivot_longer(c("cohort_start_date", "cohort_end_date")) |>
    ggplot2::ggplot(ggplot2::aes(
      x = value, y = 1, group = record_id, label = value, color = color
    )) +
    ggplot2::geom_line(linewidth = 1.5) +
    ggplot2::lims(y = c(0.75, 1.25)) +
    ggplot2::xlim(as.Date(c("2010-01-01", "2014-01-01"))) +
    ggplot2::ylab("") +
    ggplot2::xlab("Time") +
    ggplot2::scale_color_manual(values = colours) +
    ggplot2::theme(
      legend.position = "none",
      axis.ticks.y = ggplot2::element_blank(),
      axis.text.y = ggplot2::element_blank()
    )
}
x <- dplyr::tibble(
  "cohort_start_date" = as.Date(c("2010-03-01", "2011-02-15", "2012-03-05", "2013-06-15")),
  "cohort_end_date" = as.Date(c("2010-05-15", "2012-01-01", "2012-05-12", "2013-08-12"))
)
x |>
  dplyr::mutate(color = as.factor(c(1, 0, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("first + washout")
```

Whereas if we would apply initially the **washout** criteria and then the **first** one the resulting exposure would be the fourth one:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 0, 0, 1))) |>
  doplot() +
  ggplot2::ggtitle("washout + first")
```

## first and minObs

If we would apply the initially the **first** requirement and then the **minObs** one we would end with no record in the cohort:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 0, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("first + minObs")
```

Whereas if we would apply initially the **minObs** criteria and then the **first** one there would be an exposure selected, the second one:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 1, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("minObs + first")
```

## first and 2011-2012

If we would apply the initially the **first** requirement and then the **2011-2012** one we would end with no record in the cohort:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 0, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("first + 2011-2012")
```

Whereas if we would apply initially the **2011-2012** criteria and then the **first** one there would be an exposure selected, the second one:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 1, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("2011-2012 + first")
```

## washout and minObs

If we would apply the initially the **washout** requirement and then the **minObs** one we would end with only the last record selected:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 0, 0, 1))) |>
  doplot() +
  ggplot2::ggtitle("washout + first")
```

Whereas if we would apply initially the **minObs** criteria and then the **washout** one the second and the fourth exposures are the ones that would be selected:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 1, 0, 1))) |>
  doplot() +
  ggplot2::ggtitle("minObs + washout")
```

## washout and 2011-2012

If we would apply initially the **washout** requirement and then the **2011-2012** one no record would be selected:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 0, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("washout + 2011-2012")
```

Whereas if we would apply initially the **2011-2012** criteria and then the **washout** one the second record would be included:

```{r, echo = FALSE, fig.width = 7, fig.height = 2}
x |>
  dplyr::mutate(color = as.factor(c(0, 1, 0, 0))) |>
  doplot() +
  ggplot2::ggtitle("2011-2012 + washout")
```

## minObs and 2011-2012

Finally `requireObservationBeforeDrug` and `requireDrugInDateRange` will always be commutable operations so the other of this two will always be the same.

## Recommended order

Having all this into account the recommended order to apply criteria would be:

1. Require a prior drug washout or require first drug entry (particular case).

2. Require a prior observation before the drug episode.

3. Require the drugs to be in a certain date range.

Although this is the recommended order, your study design may have a different required specification, for example you may be interested on the first exposure that fulfills some criteria. Thus making applying the require first drug entry at the end.