Skip to content

Commit 71657d8

Browse files
maellekrlmlr
authored andcommitted
start work on vignette
1 parent 293870f commit 71657d8

File tree

1 file changed

+89
-0
lines changed

1 file changed

+89
-0
lines changed

vignettes/duckplyr.Rmd

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: "duckplyr"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{duckplyr}
6+
%\VignetteEngine{knitr::rmarkdown}
7+
%\VignetteEncoding{UTF-8}
8+
---
9+
10+
```{r, include = FALSE}
11+
knitr::opts_chunk$set(
12+
collapse = TRUE,
13+
comment = "#>"
14+
)
15+
```
16+
17+
```{r setup}
18+
library(duckplyr)
19+
```
20+
21+
## Design principles
22+
23+
The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**.
24+
These two facts create a tension:
25+
26+
- When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default.
27+
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
28+
Therefore, _duckplyr needs eagerness_!
29+
30+
- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
31+
_Therefore, duckplyr needs laziness_!
32+
33+
As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**.
34+
35+
> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed."
36+
37+
If the duckplyr data.frame is accessed by...
38+
39+
- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
40+
- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
41+
42+
Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
43+
44+
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
45+
Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package).
46+
A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function.
47+
By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_.
48+
49+
## How to use duckplyr
50+
51+
### For normal sized data (instead of dplyr)
52+
53+
To replace dplyr with duckplyr, you can either
54+
55+
- load duckplyr and then keep your pipeline as is.
56+
57+
```r
58+
library(conflicted)
59+
library(duckplyr)
60+
conflict_prefer("filter", "dplyr", quiet = TRUE)
61+
```
62+
63+
- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`.
64+
65+
In both cases, if an operation cannot be performed
66+
by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr.
67+
You can choose to be informed about fallbacks to dplyr, see `?fallback_config`.
68+
You can disable fallbacks by turning off automatic materialization.
69+
In that case, if an operation cannot be performed by duckplyr, your code will error.
70+
71+
### For large data (instead of dbplyr)
72+
73+
With large datasets, you want:
74+
75+
- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
76+
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
77+
- the output to not clutter all the memory. Therefore you can make use of these features:
78+
- funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
79+
- computation to files using `compute_parquet()` or `compute_csv()`.
80+
81+
82+
83+
A drawback of analyzing large data with duckplyr is that the limitations of duckplyr
84+
(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
85+
86+
## How to improve duckplyr
87+
88+
- telemetry
89+
- report issues, contribute

0 commit comments

Comments
 (0)