---
title: "Tidy your summarised result object"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{03_tidy_summarised_result}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

## `<summarised_result>` format

The `<summarised_result>` format is a standard output defined in [omopgenerics](https://darwin-eu-dev.github.io/omopgenerics/articles/summarised_result.html). The fact that it is standardised output make it a very powerful tool so multiple functions can export on the same format and built functionalities on top of it, as it can be seen in tables and [plots](https://darwin-eu.github.io/visOmopResults/articles/plots.html) vignettes. This standard output it can be some times hard to manipulate to do your custom analysis. `visOmopResults` contains tools to **tidy** your `<summarised_result>` object that are covered in this vignette.

## Tidy `<summarised_result>`

`visOmopResults` defines the method tidy for `<summarised_result>` object, what this function does is to:

### 1. Split *group*, *strata*, and *additional* pairs into separate columns:

The `<summarised_result>` object has the following pair columns: group_name-group_level, strata_name-strata_level, and additional_name-additional_level. These pairs use the `&&&` separator to combine multiple fields, for example if you want to combine cohort_name and age_group in group_name-group_level pair: `group_name = "cohort_name &&& age_group"` and `group_level = "my_cohort &&& <40"`. By default if no aggregation is produced in group_name-group_level pair: `group_name = "overall"` and `group_level = "overall"`. 

**ORIGINAL FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  group_name = c("cohort_name", c("cohort_name &&& sex"), c("sex &&& age_group")),
  group_level = c("acetaminophen", c("acetaminophen &&& Female"), c("Male &&& <40"))
) |>
  gt::gt()
```

The tidy format puts each one of the values as a columns. Making it easier to manipulate but at the same time the output is not standardised anymore as each `<summarised_result>` object will have a different number and names of columns. Missing values will be filled with the "overall" label.

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  group_name = c("cohort_name", c("cohort_name &&& sex"), c("sex &&& age_group")),
  group_level = c("acetaminophen", c("acetaminophen &&& Female"), c("Male &&& <40"))
) |>
  visOmopResults::splitGroup() |>
  gt::gt()
```

### 2. Add settings of the `<summarised_result>` object as columns:

Each `<summarised_result>` object has a setting attribute that relates the 'result_id' column with each different set of settings. The columns 'result_type', 'package_name' and 'package_version' are always present in settings, but then we may have some extra parameters depending how the object was created. So in the `<summarised_result>` format we need to use these `settings()` functions to see those variables:

**ORIGINAL FORMAT:**

`settings`:
```{r, echo = FALSE}
dplyr::tibble(
  result_id = c(1L, 2L),
  my_setting = c(TRUE, FALSE),
  package_name = "visOmopResults"
) |>
  gt::gt()
```

`<summarised_result>`:
```{r, echo = FALSE}
dplyr::tibble(
  result_id = c("1", "...", "2", "..."),
  cdm_name = c("omop", "...", "omop", "..."),
  " " = c("..."),
  additional_name = c("overall", "...", "overall", "...")
) |>
  gt::gt()
```

But in the tidy format we add the settings as columns, making that their value is repeated multiple times (there is only one row per result_id in settings, whereas there can be multiple rows in the `<summarised_result>` object). The column 'result_id' is eliminated as it does not provide information anymore. Again we loose on standardisation (multiple different settings), but we gain in flexibility:

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  cdm_name = c("omop", "...", "omop", "..."),
  " " = c("..."),
  additional_name = c("overall", "...", "overall", "..."),
  my_setting = c("TRUE", "...", "FALSE", "..."),
  package_name = c("visOmopResults", "...", "visOmopResults", "...")
) |>
  gt::gt()
```

### 3. Pivot estimates as columns:

In the `<summarised_result>` format estimates are displayed in 3 columns: 

- 'estimate_name' indicates the name of the estimate.
- 'estimate_type' indicates the type of the estimate (as all of them will be casted to character). Possible values are: *`r omopgenerics::estimateTypeChoices()`*.
- 'estimate_value' value of the estimate as `<character>`.

**ORIGINAL FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  variable_name = c("number individuals", "age", "age"),
  estimate_name = c("count", "mean", "sd"),
  estimate_type = c("integer", "numeric", "numeric"),
  estimate_value = c("100", "50.3", "20.7")
) |>
  gt::gt()
```

In the tidy format we pivot the estimates, creating a new column for each one of the 'estimate_name' values. The columns will be casted to 'estimate_type'. If there are multiple estimate_type(s) for same estimate_name they won't be casted and they will be displayed as character (a warning will be thrown). Missing data are populated with NAs.

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  variable_name = c("number individuals", "age"),
  count = c(100L, NA),
  mean = c(NA, 50.3),
  sd = c(NA, 20.7)
) |>
  gt::gt()
```

### Example

Let's see a simple example with some toy data:

```{r}
library(visOmopResults)
result <- mockSummarisedResult()
result |>
  tidy()
```

## Customise your tidy summarised_result

We have several functions to customise the tidy version of the `<summarised_result>` object. 

### Split

The functions split are provided independent:

- `splitGroup()` only splits the pair group_name-group_level columns.
- `splitStrata()` only splits the pair strata_name-strata_level columns.
- `splitAdditional()` only splits the pair additional_name-additional_level columns.

There is also the function:
- `splitAll()` that splits any pair x_name-x_level that is found on the data.

```{r}
splitAll(result)
```

### Pivot estimates

`pivotEstimates()` can be used to pivot the variables that we are interested in.

The argument `pivotEstimatesBy` specifies which are the variables that we want to use to pivot by, there are four options:

- `NULL/character()` to not pivot anything.
- `c("estimate_name")` to pivot only estimate_name.
- `c("variable_level", "estimate_name")` to pivot estimate_name and variable_level.
- `c("variable_name", "variable_level", "estimate_name")` to pivot estimate_name, variable_level and variable_name.

Note that `variable_level` can contain NA values, these will be ignored on the naming part.

```{r}
pivotEstimates(
  result, 
  pivotEstimatesBy = c("variable_name","variable_level", "estimate_name")
)
```

### Add settings

`addSettings()` is used to add the settings that we want as new columns to our `<summarised_result>` object. 

The `settingsColumns` argument is used to choose which are the settings we want to add.


```{r}
addSettings(
  result, 
  settingsColumns = "result_type"
)
```