--- title: "Nested dataframes and `purrr` style list columns" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Nested dataframes and `purrr` style list columns} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(interfacer) ``` # Nesting & list columns `interfacer` is designed to work with list columns, as generated by `purrr`. `purrr` style list columns may contain any arbitrary data type within a list. Consider the following complex dataframe for example, which includes a single regular factor column, a nested dataframe as a list column, a nested S3 `lm` object as a list column and a nested matrix as a list column: ```{r} tmp = iris %>% tidyr::nest(by_species = -Species) %>% dplyr::mutate( model = purrr::map(by_species, ~ stats::lm(Sepal.Length ~ Sepal.Width, .x)), quantiles = purrr::map(by_species, ~ sapply(.x, quantile)) ) tmp %>% dplyr::glimpse() ``` `interfacer` can be used to both represent and validate this data structure. Here the initial specifications were generated using `iclip(tmp)` and hand modified: ```{r} # Pasted from `iclip(tmp)` with minor modification: i_tmp = interfacer::iface( Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column", by_species = list(i_by_species) ~ "the by_species column", model = list(of_type(lm)) ~ "the model column", quantiles = list(matrix) ~ "the quantiles column", .groups = NULL ) i_by_species = interfacer::iface( Sepal.Length = numeric ~ "the Sepal.Length column", Sepal.Width = numeric ~ "the Sepal.Width column", Petal.Length = numeric ~ "the Petal.Length column", Petal.Width = numeric ~ "the Petal.Width column", .groups = NULL ) ``` We can then test that the input matches this specification: ```{r} tmp %>% iconvert(i_tmp) %>% dplyr::glimpse() ``` Such specifications could be used for validation, or controlling function dispatch. However it must be recognised that validation of nested dataframes is potentially computationally expensive as each individual nested dataframe must be completely validated. This could create a high overhead in situations where there are a large number of small nested dataframes. Another example of a nested list column using the diamonds dataframe demonstrates this overhead, where 276 nested dataframes need to be validated individually. This takes a few seconds on my machine. ```{r} i_diamonds_cat = interfacer::iface( cut = enum(`Fair`,`Good`,`Very Good`,`Premium`,`Ideal`, .ordered=TRUE) ~ "the cut column", color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column", clarity = enum(`I1`,`SI2`,`SI1`,`VS2`,`VS1`,`VVS2`,`VVS1`,`IF`, .ordered=TRUE) ~ "the clarity column", data = list(i_diamonds_data) ~ "A nested data column must be specified as a list", .groups = FALSE ) i_diamonds_data = interfacer::iface( carat = numeric ~ "the carat column", depth = numeric ~ "the depth column", table = numeric ~ "the table column", price = integer ~ "the price column", x = numeric ~ "the x column", y = numeric ~ "the y column", z = numeric ~ "the z column", .groups = FALSE ) nested_diamonds = ggplot2::diamonds %>% tidyr::nest(data = c(-cut,-color,-clarity)) system.time( nested_diamonds %>% iconvert(i_diamonds_cat) %>% dplyr::glimpse() ) ``` In this example the price column is removes before nesting. Errors in the validation of nested columns are bubbled up to the top level. ```{r} try( ggplot2::diamonds %>% dplyr::select(-price) %>% tidyr::nest(data = c(-cut,-color,-clarity)) %>% iconvert(i_diamonds_cat) %>% dplyr::glimpse() ) ``` # Conclusion `interfacer` does work with nested dataframes but there is a performance hit if there are nested columns with `iface` specifications. Care must be taken if this capability is used to keep data validation performant.