---
title: "Applying cohort table requirements"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{a02_cohort_table_requirements}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  eval = TRUE, 
  message = FALSE, 
  warning = FALSE,
  comment = "#>"
)
```

In this vignette we'll show how requirements related to the data contained in the cohort table can be applied. For this we'll use the Eunomia synthetic data.

```{r}
library(CodelistGenerator)
library(CohortConstructor)
library(CohortCharacteristics)
library(ggplot2)
library(dplyr)
```

```{r, include = FALSE}
if (Sys.getenv("EUNOMIA_DATA_FOLDER") == ""){
  Sys.setenv("EUNOMIA_DATA_FOLDER" = file.path(tempdir(), "eunomia"))}
if (!dir.exists(Sys.getenv("EUNOMIA_DATA_FOLDER"))){ dir.create(Sys.getenv("EUNOMIA_DATA_FOLDER"))
  CDMConnector::downloadEunomiaData()  
}
```

```{r}
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = CDMConnector::eunomia_dir())
cdm <- CDMConnector::cdm_from_con(con, cdm_schema = "main", 
                    write_schema = c(prefix = "my_study_", schema = "main"))
```

Let's start by creating a cohort of acetaminophen users. Individuals will have a cohort entry for each drug exposure record they have for acetaminophen with cohort exit based on their drug record end date. Note when creating the cohort, any overlapping records will be concatenated. 

```{r}
acetaminophen_codes <- getDrugIngredientCodes(cdm, 
                                              name = "acetaminophen", 
                                              nameStyle = "{concept_name}")
cdm$acetaminophen <- conceptCohort(cdm = cdm, 
                                   conceptSet = acetaminophen_codes, 
                                   exit = "event_end_date",
                                   name = "acetaminophen")
```

At this point we have just created our base cohort without having applied any restrictions.

```{r}
summary_attrition <- summariseCohortAttrition(cdm$acetaminophen)
plotCohortAttrition(summary_attrition)
```


## Keep only the first record per person

We can see that in our starting cohort individuals have multiple entries for each use of acetaminophen. However, we could keep only their earliest cohort entry by using `requireIsFirstEntry()` from CohortConstructor.

```{r}
cdm$acetaminophen <- cdm$acetaminophen |> 
  requireIsFirstEntry()

summary_attrition <- summariseCohortAttrition(cdm$acetaminophen)
plotCohortAttrition(summary_attrition)
```

While the number of individuals remains unchanged, records after an individual's first have been excluded. 

If we wanted to keep the latest record per person instead of the earliest we would use `requireIsLastEntry()` instead. Or if we want to keep some range of records per person we can use the `requireIsEntry()` function.

## Keep only records within a date range

Individuals may contribute multiple records over extended periods. We can filter out records that fall outside a specified date range using the `requireInDateRange` function.

```{r}
cdm$acetaminophen <- conceptCohort(cdm = cdm, 
                                 conceptSet = acetaminophen_codes, 
                                 name = "acetaminophen")
```

```{r}
cdm$acetaminophen <- cdm$acetaminophen |> 
  requireInDateRange(dateRange = as.Date(c("2010-01-01", "2015-01-01")))

summary_attrition <- summariseCohortAttrition(cdm$acetaminophen)
plotCohortAttrition(summary_attrition)
```

## Applying multiple cohort requirements

Multiple restrictions can be applied to a cohort, however it is important to note that the order that requirements are applied will often matter.

```{r}
cdm$acetaminophen_1 <- conceptCohort(cdm = cdm, 
                                 conceptSet = acetaminophen_codes, 
                                 name = "acetaminophen_1") |> 
  requireIsFirstEntry() |>
  requireInDateRange(dateRange = as.Date(c("2010-01-01", "2016-01-01")))

cdm$acetaminophen_2 <- conceptCohort(cdm = cdm, 
                                 conceptSet = acetaminophen_codes, 
                                 name = "acetaminophen_2") |>
  requireInDateRange(dateRange = as.Date(c("2010-01-01", "2016-01-01"))) |> 
  requireIsFirstEntry()
```

```{r}
summary_attrition_1 <- summariseCohortAttrition(cdm$acetaminophen_1)
summary_attrition_2 <- summariseCohortAttrition(cdm$acetaminophen_2)
```

Here we see attrition if we apply our entry requirement before our date requirement. In this case we have a cohort of people with their first ever record of acetaminophen which occurs in our study period.

```{r}
plotCohortAttrition(summary_attrition_1)
```

And here we see attrition if we apply our date requirement before our entry requirement. In this case we have a cohort of people with their first record of acetaminophen in the study period, although this will not necessarily be their first record ever.

```{r}
plotCohortAttrition(summary_attrition_2)
```


## Keep only records from cohorts with a minimum number of individuals

Another useful functionality, particularly when working with multiple cohorts or performing a network study, is provided by `requireMinCohortCount`. Here we will only keep cohorts with a minimum count, filtering out records from cohorts with fewer than this number. 

As an example let's create a cohort for every drug ingredient we see in Eunomia. We can first get the drug ingredient codes.

```{r}
medication_codes <- getDrugIngredientCodes(cdm = cdm, nameStyle = "{concept_name}")
medication_codes
```

We can see that when we make all these cohorts many have only a small number of individuals.

```{r}
cdm$medications <- conceptCohort(cdm = cdm, 
                                 conceptSet = medication_codes,
                                 name = "medications")


cohortCount(cdm$medications) |> 
  filter(number_subjects > 0) |> 
  ggplot() +
  geom_histogram(aes(number_subjects),
                 colour = "black",
                 binwidth = 25) +  
  xlab("Number of subjects") +
  theme_bw()
```

If we apply a minimum cohort count of 500, we end up with far fewer cohorts that all have a sufficient number of study participants.

```{r}
cdm$medications <- cdm$medications |> 
  requireMinCohortCount(minCohortCount = 500)

cohortCount(cdm$medications) |> 
  filter(number_subjects > 0) |> 
  ggplot() +
  geom_histogram(aes(number_subjects),
                 colour = "black",
                 binwidth = 25) + 
  xlim(0, NA) + 
  xlab("Number of subjects") +
  theme_bw()
```