--- title: "Fallback to dplyr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{15 Fallback} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} clean_output <- function(x, options) { x <- gsub("0x[0-9a-f]+", "0xdeadbeef", x) x <- gsub("dataframe_[0-9]*_[0-9]*", " dataframe_42_42 ", x) x <- gsub("[0-9]*\\.___row_number ASC", "42.___row_number ASC", x) x <- gsub("─", "-", x) x } local({ hook_source <- knitr::knit_hooks$get("document") knitr::knit_hooks$set(document = clean_output) }) knitr::opts_chunk$set( collapse = TRUE, eval = identical(Sys.getenv("IN_PKGDOWN"), "true") || (getRversion() >= "4.1" && rlang::is_installed(c("conflicted", "dbplyr", "nycflights13"))), comment = "#>" ) options(conflicts.policy = list(warn = FALSE)) Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) ``` This article details the fallback mechanism in duckplyr, which allows support for all dplyr verbs and R functions. ```{r attach} library(conflicted) library(dplyr) conflict_prefer("filter", "dplyr") ``` ## Introduction The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr. All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr. This is achieved in two ways: - A carefully selected subset of dplyr operations, R functions, and R data types are implemented in DuckDB, focusing on faithful translation. - When DuckDB does not support an operation, duckplyr falls back to dplyr, guaranteeing identical behavior. ## DuckDB mode The following operation is supported by duckplyr: ```{r} duckdb <- duckplyr::duckdb_tibble(a = 1:3) |> arrange(desc(a)) |> mutate(b = a + 1) |> select(-a) ``` The `explain()` function shows what happens under the hood: ```{r} duckdb |> explain() ``` The plan shows three operations: - a data frame scan (the input), - a sort operation, - a projection (adding the `b` column and removing the `a` column). Each operation is supported by DuckDB. The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed. ## Relation objects DuckDB accepts a tree of interconnected _relation objects_ as input. Each relation object represents a logical step of the execution plan. The duckplyr package translates dplyr verbs into relation objects. The `last_rel()` function shows the last relation that has been materialized: ```{r} duckplyr::last_rel() ``` It is `NULL` because nothing has been computed yet. Converting the object to a data frame triggers the computation: ```{r} duckdb |> collect() duckplyr::last_rel() ``` The `last_rel()` function now shows a relation that describes logical plan for executing the whole pipeline. ## Help from dplyr Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback: ```{r} verbose_plus_one <- function(x) { message("Adding one to ", paste(x, collapse = ", ")) x + 1 } fallback <- duckplyr::duckdb_tibble(a = 1:3) |> arrange(desc(a)) |> mutate(b = verbose_plus_one(a)) |> select(-a) ``` The `verbose_plus_one()` function is not supported by DuckDB, so the `mutate()` step is forwarded to dplyr and already executed (eagerly) when the pipeline is defined. This is confirmed by the `last_rel()` function: ```{r} duckplyr::last_rel() ``` Only the `arrange()` step is executed by DuckDB. Because the dplyr implementation of `mutate()` needs the data before it can proceed, the data is first converted to a data frame, and this triggers the materialization of the first step. The `explain()` function also confirms indirectly that at least a part of the operation is handled by dplyr: ```{r} fallback |> explain() ``` The final plan now only consists of a data frame scan. This is the result of the `mutate()` step, which at this stage already has been executed by dplyr. Converting the final object to a data frame triggers the rest of the computation: ```{r} fallback |> collect() duckplyr::last_rel() ``` The `last_rel()` function confirms that only the final `select()` is handled by DuckDB again. ## Enforce DuckDB operation For any duck frame, one can control the automatic materialization. For fallbacks to dplyr, automatic materialization must be allowed for the duck frame at hand, as dplyr necessitates eager evaluation. Therefore, by making a data frame stingy, one can ensure a pipeline will error when a fallback to dplyr would have normally happened. See `vignette("prudence")` for details. By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. ## Configure fallbacks Using the `fallback_sitrep()` and `fallback_config()` functions you can examine and change settings related to fallbacks. - You can choose to make fallbacks verbose with `fallback_config(info = TRUE)`. - You can change settings related to logging and reporting fallback to duckplyr development team to inform their work. See `vignette("telemetry")` for details. ## Conclusion The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB. It is transparent to the user and only triggers when necessary. With small or medium-sized data sets, it will not even be noticeable in most settings. See `vignette("large")` for techniques for working with large data, `vignette("limits")` for the currently implementated translations, `vignette("prudence")` for details on controlling fallback behavior, and `vignette("telemetry")` for the automatic reporting of fallback situations.