|
| 1 | +--- |
| 2 | +title: "duckplyr" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{duckplyr} |
| 6 | + %\VignetteEngine{knitr::rmarkdown} |
| 7 | + %\VignetteEncoding{UTF-8} |
| 8 | +--- |
| 9 | + |
| 10 | +```{r, include = FALSE} |
| 11 | +knitr::opts_chunk$set( |
| 12 | + collapse = TRUE, |
| 13 | + comment = "#>" |
| 14 | +) |
| 15 | +``` |
| 16 | + |
| 17 | +```{r setup} |
| 18 | +library(duckplyr) |
| 19 | +``` |
| 20 | + |
| 21 | +## Design principles |
| 22 | + |
| 23 | +The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**. |
| 24 | +These two facts create a tension: |
| 25 | + |
| 26 | +- When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default. |
| 27 | + Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration. |
| 28 | + Therefore, _duckplyr needs eagerness_! |
| 29 | + |
| 30 | +- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table. |
| 31 | + _Therefore, duckplyr needs laziness_! |
| 32 | + |
| 33 | +As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**. |
| 34 | + |
| 35 | +> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." |
| 36 | +
|
| 37 | +If the duckplyr data.frame is accessed by... |
| 38 | + |
| 39 | +- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame. |
| 40 | +- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance). |
| 41 | + |
| 42 | +Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). |
| 43 | + |
| 44 | +Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? |
| 45 | +Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package). |
| 46 | +A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function. |
| 47 | +By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_. |
| 48 | + |
| 49 | +## How to use duckplyr |
| 50 | + |
| 51 | +### For normal sized data (instead of dplyr) |
| 52 | + |
| 53 | +To replace dplyr with duckplyr, you can either |
| 54 | + |
| 55 | +- load duckplyr and then keep your pipeline as is. |
| 56 | + |
| 57 | +```r |
| 58 | +library(conflicted) |
| 59 | +library(duckplyr) |
| 60 | +conflict_prefer("filter", "dplyr", quiet = TRUE) |
| 61 | +``` |
| 62 | + |
| 63 | +- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`. |
| 64 | + |
| 65 | +In both cases, if an operation cannot be performed |
| 66 | +by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. |
| 67 | +You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. |
| 68 | +You can disable fallbacks by turning off automatic materialization. |
| 69 | +In that case, if an operation cannot be performed by duckplyr, your code will error. |
| 70 | + |
| 71 | +### For large data (instead of dbplyr) |
| 72 | + |
| 73 | +With large datasets, you want: |
| 74 | + |
| 75 | +- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`. |
| 76 | +- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. |
| 77 | +- the output to not clutter all the memory. Therefore you can make use of these features: |
| 78 | + - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size. |
| 79 | + - computation to files using `compute_parquet()` or `compute_csv()`. |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +A drawback of analyzing large data with duckplyr is that the limitations of duckplyr |
| 84 | +(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. |
| 85 | + |
| 86 | +## How to improve duckplyr |
| 87 | + |
| 88 | +- telemetry |
| 89 | +- report issues, contribute |
0 commit comments