Description
Feature Request
In the documentation for bootstraps()
, note that the bootstrap method assumes data come from independent simple random samples (potentially within strata), and provide a reference about when this assumption is (in)appropriate. Optionally, point users to other packages that implement bootstrap methods appropriate to their data.
Why this matters:
This feature is important because the 'tidymodels' suite of packages is sometimes the first (or one of the first) places that users are learning about the bootstrap. That's why materials such as this vignette take the time to give a quick introduction to it. For users without much statistical training, the bootstrap can seem like a silver bullet tool, but it's easy to forget--or never learn--that the basic bootstrap does make strong implicit assumptions about how your data were collected, and so it can easily be misused.
References on bootstrap methods for non-iid data:
For surveys:
- Zeinab Mashreghi. David Haziza. Christian Léger. "A survey of bootstrap methods in finite population sampling." Statist. Surv. 10 1 - 52, 2016. https://doi.org/10.1214/16-SS113
For time series:
This is an area where I don't have much experience, so I'm not sure of a good general reference paper to recommend on bootstraps for time series.
R Packages:
For surveys:
- 'svrep': This vignette discusses bootstrap methods for survey data and how to implement them using the package. The key functions are
as_bootstrap_design()
andas_gen_boot_design()
., which implement bootstrap methods for a wide variety of complex sampling methods. - 'survey': This is the central package in R for analyzing data for complex surveys, and it has a few very useful functions for implementing bootstrap methods.
For other types of complex data
- The well-known 'boot' package implements a few different types of bootstrap, applicable to (for example) time series.
Background
The basic bootstrap methods implemented in 'rsample' are based on an assumption of independent sampling (potentially within strata), which justifies the use of independent sampling with replacement as a method of forming bootstrap resamples. But this assumption is inappropriate for many datasets used in practice. A couple big examples are complex survey data (such as the widely-used American Community Survey) or cluster-randomized experiments, where the basic bootstrap method can produce drastic underestimates of sampling error.
There are many variations of the bootstrap that have been developed for handling data that aren't simple, independent random samples (the generalized bootstrap, the block bootstrap, the rescaled bootstrap, etc.) I don't know that 'rsample' makes sense as a place to implement these various bootstraps. But I do think that users of 'rsample' would be well-served by documentation that makes them aware of the limitations of the bootstrap method implemented in the package.