In this vignette, we will explore the OmopSketch functions designed to provide an overview of the clinical tables within a CDM object (observation_period, visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, and death). Specifically, there are four key functions that facilitate this:
summariseClinicalRecords()
and
tableClinicalRecords()
: Use them to create a summary
statistics with key basic information of the clinical table (e.g.,
number of records, number of concepts mapped, etc.)
summariseRecordCount()
and
plotRecordCount()
: Use them to summarise the number of
records within a specific time interval.
Let’s see an example of its functionalities. To start with, we will load essential packages and create a mock cdm using the mockOmopSketch() database.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(OmopSketch)
# Connect to mock database
cdm <- mockOmopSketch()
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#> target signature 'duckdb_connection#Id'.
#> "duckdb_connection#ANY" would also be valid
Let’s now use summariseClinicalTables()
from the
OmopSketch package to help us have an overview of one of the clinical
tables of the cdm (i.e., condition_occurrence).
summarisedResult <- summariseClinicalRecords(cdm, "condition_occurrence")
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:04:43.072078
#>
#> ✔ Summary finished, at 2024-10-15 11:04:43.101135
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd, median, q25, q75, min, max
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-15 11:04:43.45866
#>
#> ✔ Summary finished, at 2024-10-15 11:04:43.500044
#> ℹ Summarising in_observation, standard, domain_id, and type information
summarisedResult |> print()
#> # A tibble: 18 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 unknown omop_table condition_occurrence overall overall
#> 2 1 unknown omop_table condition_occurrence overall overall
#> 3 1 unknown omop_table condition_occurrence overall overall
#> 4 1 unknown omop_table condition_occurrence overall overall
#> 5 1 unknown omop_table condition_occurrence overall overall
#> 6 1 unknown omop_table condition_occurrence overall overall
#> 7 1 unknown omop_table condition_occurrence overall overall
#> 8 1 unknown omop_table condition_occurrence overall overall
#> 9 1 unknown omop_table condition_occurrence overall overall
#> 10 1 unknown omop_table condition_occurrence overall overall
#> 11 1 unknown omop_table condition_occurrence overall overall
#> 12 1 unknown omop_table condition_occurrence overall overall
#> 13 1 unknown omop_table condition_occurrence overall overall
#> 14 1 unknown omop_table condition_occurrence overall overall
#> 15 1 unknown omop_table condition_occurrence overall overall
#> 16 1 unknown omop_table condition_occurrence overall overall
#> 17 1 unknown omop_table condition_occurrence overall overall
#> 18 1 unknown omop_table condition_occurrence overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
Notice that the output is in the summarised result format.
We can use the arguments to specify which statistics we want to
perform. For example, use the argument recordsPerPerson
to
indicate which estimates you are interested regarding the number of
records per person.
summarisedResult <- summariseClinicalRecords(cdm,
"condition_occurrence",
recordsPerPerson = c("mean", "sd", "q05", "q95"))
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:04:46.802842
#>
#> ✔ Summary finished, at 2024-10-15 11:04:46.830882
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd, q05, q95
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-15 11:04:47.162589
#>
#> ✔ Summary finished, at 2024-10-15 11:04:47.202514
#> ℹ Summarising in_observation, standard, domain_id, and type information
summarisedResult |>
filter(variable_name == "records_per_person") |>
select(variable_name, estimate_name, estimate_value)
#> # A tibble: 4 × 3
#> variable_name estimate_name estimate_value
#> <chr> <chr> <chr>
#> 1 records_per_person mean 19
#> 2 records_per_person sd 4.5924839782964
#> 3 records_per_person q05 13
#> 4 records_per_person q95 27
You can further specify if you want to include the number of records
in observation (inObservation = TRUE
), the number of
concepts mapped (standardConcept = TRUE
), which types of
source vocabulary does the table contain
(sourceVocabulary = TRUE
), which types of domain does the
vocabulary have (domainId = TRUE
) or the concept’s type
(typeConcept = TRUE
).
summarisedResult <- summariseClinicalRecords(cdm,
"condition_occurrence",
recordsPerPerson = c("mean", "sd", "q05", "q95"),
inObservation = TRUE,
standardConcept = TRUE,
sourceVocabulary = TRUE,
domainId = TRUE,
typeConcept = TRUE)
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:04:50.356662
#>
#> ✔ Summary finished, at 2024-10-15 11:04:50.382675
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd, q05, q95
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-15 11:04:50.730049
#>
#> ✔ Summary finished, at 2024-10-15 11:04:50.771273
#> ℹ Summarising in_observation, standard, domain_id, source, and type information
summarisedResult |>
select(variable_name, estimate_name, estimate_value) |>
glimpse()
#> Rows: 17
#> Columns: 3
#> $ variable_name <chr> "number records", "number subjects", "number subjects",…
#> $ estimate_name <chr> "count", "count", "percentage", "mean", "sd", "q05", "q…
#> $ estimate_value <chr> "1900", "100", "100", "19", "4.5924839782964", "13", "2…
Additionally, you can also stratify the previous results by sex and age groups:
summarisedResult <- summariseClinicalRecords(cdm,
"condition_occurrence",
recordsPerPerson = c("mean", "sd", "q05", "q95"),
inObservation = TRUE,
standardConcept = TRUE,
sourceVocabulary = TRUE,
domainId = TRUE,
typeConcept = TRUE,
sex = TRUE,
ageGroup = list("<35" = c(0, 34), ">=35" = c(35, Inf)))
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:04:55.753985
#>
#> ✔ Summary finished, at 2024-10-15 11:04:55.871513
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd, q05, q95
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-15 11:04:56.388433
#>
#> ✔ Summary finished, at 2024-10-15 11:04:56.562709
#> ℹ Summarising in_observation, standard, domain_id, source, and type information
summarisedResult |>
select(variable_name, strata_level, estimate_name, estimate_value) |>
glimpse()
#> Rows: 153
#> Columns: 4
#> $ variable_name <chr> "number records", "number subjects", "number records", …
#> $ strata_level <chr> "overall", "overall", "<35", ">=35", "<35", ">=35", "Fe…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count", "…
#> $ estimate_value <chr> "1900", "100", "1405", "495", "78", "29", "915", "985",…
Notice that, by default, the “overall” group will be also included, as well as crossed strata (that means, sex == “Female” and ageGroup == “>35”).
Also, see that the analysis can be conducted for multiple OMOP tables at the same time:
summarisedResult <- summariseClinicalRecords(cdm,
c("observation_period","drug_exposure"),
recordsPerPerson = c("mean","sd"),
inObservation = FALSE,
standardConcept = FALSE,
sourceVocabulary = FALSE,
domainId = FALSE,
typeConcept = FALSE)
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:05:01.840507
#>
#> ✔ Summary finished, at 2024-10-15 11:05:01.872147
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd
#> → Start summary of data, at 2024-10-15 11:05:02.21122
#>
#> ✔ Summary finished, at 2024-10-15 11:05:02.2494
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:05:04.750835
#>
#> ✔ Summary finished, at 2024-10-15 11:05:04.777502
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd
#> → Start summary of data, at 2024-10-15 11:05:05.125115
#>
#> ✔ Summary finished, at 2024-10-15 11:05:05.162365
summarisedResult |>
select(group_level, variable_name, estimate_name, estimate_value) |>
glimpse()
#> Rows: 10
#> Columns: 4
#> $ group_level <chr> "observation_period", "observation_period", "observatio…
#> $ variable_name <chr> "number records", "number subjects", "number subjects",…
#> $ estimate_name <chr> "count", "count", "percentage", "mean", "sd", "count", …
#> $ estimate_value <chr> "100", "100", "100", "1", "0", "5600", "100", "100", "5…
tableClinicalRecords()
will help you to tidy the
previous results and create a gt table.
summarisedResult <- summariseClinicalRecords(cdm,
"condition_occurrence",
recordsPerPerson = c("mean", "sd", "q05", "q95"),
inObservation = TRUE,
standardConcept = TRUE,
sourceVocabulary = TRUE,
domainId = TRUE,
typeConcept = TRUE,
sex = TRUE)
#> ℹ Summarising table counts
#> ℹ The following estimates will be computed:
#> → Start summary of data, at 2024-10-15 11:05:08.074747
#>
#> ✔ Summary finished, at 2024-10-15 11:05:08.129548
#> ℹ Summarising records per person
#> ℹ The following estimates will be computed:
#> • records_per_person: mean, sd, q05, q95
#> ! Table is collected to memory as not all requested estimates are supported on
#> the database side
#> → Start summary of data, at 2024-10-15 11:05:08.614597
#>
#> ✔ Summary finished, at 2024-10-15 11:05:08.701895
#> ℹ Summarising in_observation, standard, domain_id, source, and type information
summarisedResult |>
tableClinicalRecords()
#> ! Results have not been suppressed.
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
unknown | |||
condition_occurrence; overall | |||
Number records | - | N | 1,900 |
Number subjects | - | N (%) | 100 (100.00%) |
Records per person | - | Mean (SD) | 19.00 (4.59) |
q05 | 13 | ||
q95 | 27 | ||
condition_occurrence; Female | |||
Number records | - | N | 915 |
Number subjects | - | N (%) | 49 (100.00%) |
Records per person | - | Mean (SD) | 18.67 (4.83) |
q05 | 13 | ||
q95 | 27 | ||
In observation | Yes | N (%) | 915 (100.00%) |
Standard concept | Source | N (%) | 915 (100.00%) |
Source vocabulary | No matching concept | N (%) | 915 (100.00%) |
Domain | Condition | N (%) | 915 (100.00%) |
Type concept id | 1 | N (%) | 915 (100.00%) |
condition_occurrence; Male | |||
Number records | - | N | 985 |
Number subjects | - | N (%) | 51 (100.00%) |
Records per person | - | Mean (SD) | 19.31 (4.38) |
q05 | 12 | ||
q95 | 26 | ||
In observation | Yes | N (%) | 985 (100.00%) |
Standard concept | Source | N (%) | 985 (100.00%) |
Source vocabulary | No matching concept | N (%) | 985 (100.00%) |
Domain | Condition | N (%) | 985 (100.00%) |
Type concept id | 1 | N (%) | 985 (100.00%) |
OmopSketch can also help you to summarise the trend of the records of
an OMOP table. See the example below, where we use
summariseRecordCount()
to count the number of records
within each year, and then, we use plotRecordCount()
to
create a ggplot with the trend.
summarisedResult <- summariseRecordCount(cdm, "drug_exposure", unit = "year", unitInterval = 1)
#> ℹ The following estimates will be computed:
#> • interval_group: count
#> • sex: count
#> • age_group: count
#> → Start summary of data, at 2024-10-15 11:05:11.107661
#>
#> ✔ Summary finished, at 2024-10-15 11:05:11.169164
summarisedResult |> print()
#> # A tibble: 65 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 mockOmopSketch omop_table drug_exposure overall overall
#> 2 1 mockOmopSketch omop_table drug_exposure overall overall
#> 3 1 mockOmopSketch omop_table drug_exposure overall overall
#> 4 1 mockOmopSketch omop_table drug_exposure overall overall
#> 5 1 mockOmopSketch omop_table drug_exposure overall overall
#> 6 1 mockOmopSketch omop_table drug_exposure overall overall
#> 7 1 mockOmopSketch omop_table drug_exposure overall overall
#> 8 1 mockOmopSketch omop_table drug_exposure overall overall
#> 9 1 mockOmopSketch omop_table drug_exposure overall overall
#> 10 1 mockOmopSketch omop_table drug_exposure overall overall
#> # ℹ 55 more rows
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
summarisedResult |> plotRecordCount()
#> ! The following column type were changed:
#> • variable_level: from double to character
Note that you can adjust the time interval period using the
unit
argument, which can be set to either “year” or
“month”, and the unitInterval
argument, which must be an
integer specifying the number of years or months which to count the
records. See the example below, where it shows the number of records
every 18 months:
summariseRecordCount(cdm, "drug_exposure", unit = "month", unitInterval = 18) |>
plotRecordCount()
#> ℹ The following estimates will be computed:
#> • interval_group: count
#> • sex: count
#> • age_group: count
#> → Start summary of data, at 2024-10-15 11:05:13.014239
#>
#> ✔ Summary finished, at 2024-10-15 11:05:13.07713
#> ! The following column type were changed:
#> • variable_level: from double to character
We can further stratify our counts by sex (setting argument
sex = TRUE
) or by age (providing an age group). Notice that
in both cases, the function will automatically create a group called
overall with all the sex groups and all the age groups.
summariseRecordCount(cdm, "drug_exposure",
unit = "month",
unitInterval = 18,
sex = TRUE,
ageGroup = list("<30" = c(0,29),
">=30" = c(30,Inf))) |>
plotRecordCount()
#> ℹ The following estimates will be computed:
#> • interval_group: count
#> • age_group: count
#> • sex: count
#> → Start summary of data, at 2024-10-15 11:05:14.927139
#>
#> ✔ Summary finished, at 2024-10-15 11:05:15.191202
#> ! The following column type were changed:
#> • variable_level: from double to character
By default, plotRecordCount()
does not apply faceting or
colour to any variables. This can result confusing when stratifying by
different variables, as seen in the previous picture. We can use VisOmopResults
package to help us know by which columns we can colour or face by:
summariseRecordCount(cdm, "drug_exposure",
unit = "month",
unitInterval = 18,
sex = TRUE,
ageGroup = list("0-29" = c(0,29),
"30-Inf" = c(30,Inf))) |>
visOmopResults::tidyColumns()
#> ℹ The following estimates will be computed:
#> • interval_group: count
#> • age_group: count
#> • sex: count
#> → Start summary of data, at 2024-10-15 11:05:16.995968
#>
#> ✔ Summary finished, at 2024-10-15 11:05:17.285141
#> [1] "cdm_name" "omop_table" "age_group" "sex"
#> [5] "variable_name" "variable_level" "count" "time_interval"
#> [9] "result_type" "package_name" "package_version" "unit"
#> [13] "unitInterval"
Then, we can simply specify this by using the facet
and
colour
arguments from plotRecordCount()
summariseRecordCount(cdm, "drug_exposure",
unit = "month",
unitInterval = 18,
sex = TRUE,
ageGroup = list("0-29" = c(0,29),
"30-Inf" = c(30,Inf))) |>
plotRecordCount(facet = omop_table ~ age_group, colour = "sex")
#> ℹ The following estimates will be computed:
#> • interval_group: count
#> • age_group: count
#> • sex: count
#> → Start summary of data, at 2024-10-15 11:05:18.881616
#>
#> ✔ Summary finished, at 2024-10-15 11:05:19.149899
#> ! The following column type were changed:
#> • variable_level: from double to character
Finally, disconnect from the cdm
PatientProfiles::mockDisconnect(cdm = cdm)