--- title: "Dataframe validation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Dataframe validation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(interfacer) ``` # Rationale `interfacer` is designed to support package authors who wish to use dataframes as input parameters to package functions. In this case assumptions about the structure of the input dataframe, in terms of expected column names, expected column data types, and expected grouping structure is a common problem that leads to a lot of code to validate input and detect edge cases in grouping, and creates the requirement for detailed documentation about the nature of accepted input dataframes. `interfacer` provides a mechanism for simply specifying input dataframe constraints as an `iface` specification, a one liner for validating input, and an `roxygen2` tag for automating documentation of dataframe inputs. This is not dissimilar conceptually to the definition of a table in a relational database, or the specification of an XML schema. `interfacer` also provides capabilities that support checking dataframe function outputs, dispatching to functions based on dataframe input structure, and flexibly handling unexpectedly grouped data. # Defining an interface An `iface` specification defines the structure of acceptable dataframes. It is a list of column names, plus types and some documentation about the column. ```{r} i_test = iface( id = integer ~ "an integer ID", test = logical ~ "the test result" ) ``` Printing an interface specification shows the structure that the `iface` defines. ```{r, results='markup'} cat(print(i_test)) ``` An `iface` specification is associated with a specific function parameter by being set as the default value for that parameter. This is a dummy default value but when combined with `ivalidate` in the function body a user supplied dataframe is validated to ensure it is of the right shape. We can use `@iparam ` in the `roxygen2` documentation to describe the dataframe constraints. ```{r} #' An example function #' #' @iparam mydata a dataframe input which should conform to `i_test` #' @param another an example #' @param ... not used #' #' @return the conformant dataframe #' @export example_fn = function( mydata = i_test, another = "value", ... ) { mydata = ivalidate(mydata) return(mydata) } ``` In this case when we later call `example_fn` the data is checked against the requirements by `ivalidate`, and if acceptable passed on to the rest of the function body (in this case it does nothing and the validated input is returned). If we call this function with data that conforms the validation succeeds and the validated input data is returned. ```{r} example_data = tibble::tibble( id = c(1,2,3), # this is a numeric vector test = c(TRUE,FALSE,TRUE) ) # this returns the qualifying data example_fn( example_data, "value for another" ) %>% dplyr::glimpse() ``` It should be noted that although we passed a numeric vector in the `id` column to the function it has been coerced into an `int` vector by `ivalidate`. Data type checking in `interfacer` is permissive in that if something can be coerced without warning it will be. If we pass non-conformant data `ivalidate` throws an informative error about what is wrong with the data. In this case the `test` column is missing: ```{r} bad_example_data = tibble::tibble( id = c(1,2,3), wrong_name = c(TRUE,FALSE,TRUE) ) # this causes an error as example_data_2$wrong_test is wrongly named try(example_fn( bad_example_data, "value for another" )) ``` We can recover from this error by renaming the columns before passing `bad_example_data` to `example_fn()`. In a second example the input data frame is non-conformant to the specification as the id column cannot be coerced to an integer. ```{r} bad_example_data_2 = tibble::tibble( id = c(1, 2.1, 3), # cannot be cleanly coerced to integer. test = c(TRUE,FALSE,TRUE) ) try(example_fn( bad_example_data_2, "value for another" )) ``` This error aims to be informative enough for the user to fix the problem. # Extension and composition Interface specifications can be composed and extended. In this case an extension of the `i_test` specification can be created: ```{r} i_test_extn = iface( i_test, extra = character ~ "a new value", .groups = FALSE ) print(i_test_extn) ``` This extended `iface` specification adds in the constraint for a character column named `extra` and that there must not be any grouping. This is used to constrain the input of another example function as before. We also constrain the output of this second function to be conformant to the original specification using `ireturn`. Examples of documenting the input parameter and the output parameter are provided here: ```{r} #' Another example function #' #' @iparam mydata a more constrained input #' @param another an example #' @param ... not used #' #' @return `r i_test` #' @export example_fn2 = function( mydata = i_test_extn, ... ) { mydata = ivalidate(mydata, ..., .prune = TRUE) mydata = mydata %>% dplyr::select(-extra) # check the return value conforms to a new specification ireturn(mydata, i_test) } ``` In this case the `ivalidate` call prunes unneeded data from the dataframe, removing any extra columns, and also ensures that the input is not grouped in any way. (Grouping is described in more detail below.) ```{r} grouped_example_data = tibble::tibble( id = c(1,2,3), test = c(TRUE,FALSE,TRUE), extra = c("a","b","c"), unneeded = c("x","y","z") ) %>% dplyr::group_by(id) ``` This is rejected because the grouping is incorrect. An informative error message is provided: ```{r} try(example_fn2(grouped_example_data)) ``` Following the instructions in the error message makes this previously failing data validate against `i_test_extn`: ```{r} grouped_example_data %>% dplyr::ungroup() %>% example_fn2() %>% dplyr::glimpse() ``` # Grouping Unanticipated grouping is a common cause of unexpected behaviour in functions that operate on dataframes. `interfacer` can also specify what degree of grouping is expected. This can take the form of constraints that a) enforce that no grouping is present, or b) enforce that the dataframe is grouped by exactly a given set of columns, or c) enforce that a data frame is grouped by at least a given set of columns (with possibly more). An `iface` specification can permissive or dogmatic about the grouping of the input. If the .groups option in an `iface` specification is NULL (e.g. `iface(..., .groups=NULL)`) then any grouping is allowed. If it is `FALSE` then no grouping is allowed. The third option is to supply a one sided formula. In this case the variables in the formula define the grouping that must be exactly present, e.g. `~ grp1 + grp2`, but if it also includes a `.`, then additional grouping is also permitted (e.g. `~ . + grp1 + grp2`). This permissive form would allow a grouping such as `df %>% group_by(anything, grp1, grp2)`. ```{r} i_diamonds = interfacer::iface( carat = numeric ~ "the carat column", color = enum(`D`,`E`,`F`,`G`,`H`,`I`,`J`, .ordered=TRUE) ~ "the color column", x = numeric ~ "the x column", y = numeric ~ "the y column", z = numeric ~ "the z column", # This specifies a permissive grouping with at least `carat` and `cut` columns .groups = ~ . + carat + cut ) if (rlang::is_installed("ggplot2")) { # permissive grouping with the `~ . + carat + cut` groups rule ggplot2::diamonds %>% dplyr::group_by(color, carat, cut) %>% # in a usual workflow this would be an `ivalidate` call within a package # function but for this example we are directly calling the underlying function # `iconvert` iconvert(i_diamonds, .prune = TRUE) %>% dplyr::glimpse() } ``` If a group column is specified it must be present, regardless of the rest of the `iface` specification. So in this example the `cut` column is required by the `i_diamonds` contract but its data type is not specified. Rather than create a third example function we have in this example used `iconvert` which is an interactive for of `ivalidate`. # Documentation The `roxygen2` block of documentation for this second interface is determined by the `#' @iparam` block, which uses the underlying function `idocument`. Demonstrating the behaviour of the `@iparam` `roxygen2` tag is hard in a vignette but essentially it inserts the following block into the documentation when `devtools::document` is called: ```{r} cat(idocument(example_fn2)) ``` # Type coercion `interfacer` does not implement a rigid type system, but rather a permissive one. If the provided data can be coerced to the specified type without major loss then this is automatically done, as long as it can proceed with no warnings. In this example `id` (expected to be an integer) is provided as a `character` and `extra` (expected to be a character) is coerced from the provided numeric. ```{r} tibble::tibble( id=c("1","2","3"), test = c(TRUE,FALSE,TRUE), extra = 1.1 ) %>% example_fn2() %>% dplyr::glimpse() ``` Completely incorrect data types on the other hand are picked up and rejected. In this case the data supplied for `id` cannot be cast to integer without loss. Similar behaviour is seen if logical data is anything other than 0 or 1 for example. ```{r} try(example_fn( tibble::tibble( id= c("1.1","2","3"), test = c(TRUE,FALSE,TRUE) ))) ``` Factors might have allowable levels as well. For this we define them as an `enum` which accepts a list of values, which then must be matched by the levels of a provided factor. The order of the levels will be taken from the `iface` specification and re-levelling of inputs is taken to ensure the factor levels match the specification. If `.drop = TRUE` is specified then values which don't match the levels will be cast to `NA` rather than causing failure to allow conformance to a subset of factor values. ```{r} if (rlang::is_installed("ggplot2")) { i_diamonds = iface( color = enum(D,E,F,G,H,I,J,extra) ~ "the colour", cut = enum(Ideal, Premium, .drop=TRUE) ~ "the cut", price = integer ~ "the price" ) ggplot2::diamonds %>% iconvert(i_diamonds, .prune = TRUE) %>% dplyr::glimpse() } ``` # More complex type constraints The type of a dataframe column can be defined as a basic data-type, however more complex constraints are also available provided in `interfacer`. These can be listed by searching the help system with `??interfacer::type.` at the console. ```{r echo=FALSE} tmp = help.search(package = "interfacer", pattern = "type\\..*") tmp$matches %>% dplyr::transmute(Topic = stringr::str_remove(Topic, "type\\."), Title) %>% knitr::kable() ``` The individual help files for these functions explain their use but in an `iface` specification they are used on the left hand side of a formula and can be composed to allow multiple constraints. For example: ```{r eval=FALSE} iface( col1 = double + finite ~ "A finite double", col2 = integer + in_range(0,100) ~ "an integer in the range 0 to 100 inclusive", col3 = numeric + in_range(0,10, include.max=FALSE) ~ "a numeric 0 <= x < 10", col4 = date ~ "A date", col5 = logical + not_missing ~ "A non-NA logical", col6 = logical + default(TRUE) ~ "A logical with missing (i.e. NA) values coerced to TRUE", col7 = factor ~ "Any factor", col8 = enum(`A`,`B`,`C`) + not_missing ~ "A factor with exactly 3 levels A, B and C and no NA values" ) ``` Column wise default values can be supplied with the `default(...)` pseudo-function and ranges with `in_range(...)`. Their documentation is available in `?interfacer::type.default` and `?interfacer::type.in_range`. It can be noted that although the internal functions are all prefixed with `type.XXX`, the prefix is not needed in the `iface` specification. It is also theoretically possible to supply your own checks in this specification. These must be in the form of a function that accepts one vector as input and produces one vector as output, or throws an error as in this example. ```{r} uppercase = function(x) { if (any(x != toupper(x))) stop("not upper case input",call. = FALSE) return(x) } custom_eg = function(df = iface( text = character + uppercase ~ "An uppercase input only" )) { df = ivalidate(df) return(df) } tibble::tibble(text = "SUCCESS") %>% custom_eg() try(tibble::tibble(text = "fail") %>% custom_eg()) ``` N.B. When using custom conditions within a package they must be visible to `interfacer` this normally means they will need to be exported and may need to be referred to with package prefix. A final option is to use an `as.XXX` function as a condition. In this example we define a column as a `POSIXct` type, and a second column is defined as a `ts` class vector: ```{r} # Coerce the `date_col` to a POSIXct and custom_eg_2 = function( df = iface( date_col = POSIXct ~ "a posix date", ts_col = of_type(ts) ~ "A timeseries vector" )) { df = ivalidate(df) return(lapply(df, class)) } tibble::tibble( date_col = c("2001-01-01","2002-01-01"), ts_col = ts(c(2,1)) ) %>% custom_eg_2() ``` # Default dataframe values Because `interfacer` hijacks the R default value for a function parameter to define the input dataframe constraints, there needs to be an alternative way to supply a default value if one is needed. To do this the `iface` specification can define a default. This can either be a) A zero length dataframe, or b) a dataframe supplied at the time of interface definition, or c) a data frame supplied at the time of function execution. To get a zero length dataframe as the default the value of `TRUE` is passed to the `.default` value of `iface`: ```{r} i_iris = interfacer::iface( Sepal.Length = numeric ~ "the Sepal.Length column", Sepal.Width = numeric ~ "the Sepal.Width column", Petal.Length = numeric ~ "the Petal.Length column", Petal.Width = numeric ~ "the Petal.Width column", Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column", .groups = NULL, .default = TRUE ) test_fn = function(i = i_iris, ...) { # if i is not provided (a missing value) the default zero length # dataframe defined by `i_iris` is used. i = ivalidate(i) return(i) } # Outputs a zero length data frame as the default value test_fn() %>% dplyr::glimpse() ``` In this second example the default value is specified during the interface specification. ```{r} i_iris_2 = interfacer::iface( Sepal.Length = numeric ~ "the Sepal.Length column", Sepal.Width = numeric ~ "the Sepal.Width column", Petal.Length = numeric ~ "the Petal.Length column", Petal.Width = numeric ~ "the Petal.Width column", Species = enum(`setosa`,`versicolor`,`virginica`) ~ "the Species column", .groups = NULL, .default = iris ) test_fn_2 = function(i = i_iris_2, ...) { i = ivalidate(i) return(i) } # Outputs the 150 row iris data frame as a default value from the definition of `i_iris_2` test_fn_2() %>% dplyr::glimpse() ``` In this third example we override the default on a per function basis by supplying a default to `ivalidate` within the function body. In this case the default is just the first 5 rows: ```{r} test_fn_3 = function(i = i_iris_2, ...) { i = ivalidate(i, .default = iris %>% head(5)) return(i) } # Outputs the first 5 rows of the iris data frame as the default value test_fn_3() %>% dplyr::glimpse() ``` # Conclusion This vignette covers the primary validation functions of `interfacer`, including missing columns, data-type checks and enforcing grouping structure. Automation of documentation and interface composition is also covered. Please see the other vignettes for topics such as function dispatch based on `iface` specifications, automatically handling grouped input, nesting and `purrr` style list columns, and a quick summary of tools to help developers.