Skip to content

Better documentation about when the 'rsample' bootstrap function is (in)appropriate #405

Open
@bschneidr

Description

@bschneidr

Feature Request

In the documentation for bootstraps(), note that the bootstrap method assumes data come from independent simple random samples (potentially within strata), and provide a reference about when this assumption is (in)appropriate. Optionally, point users to other packages that implement bootstrap methods appropriate to their data.

Why this matters:

This feature is important because the 'tidymodels' suite of packages is sometimes the first (or one of the first) places that users are learning about the bootstrap. That's why materials such as this vignette take the time to give a quick introduction to it. For users without much statistical training, the bootstrap can seem like a silver bullet tool, but it's easy to forget--or never learn--that the basic bootstrap does make strong implicit assumptions about how your data were collected, and so it can easily be misused.

References on bootstrap methods for non-iid data:

For surveys:

  • Zeinab Mashreghi. David Haziza. Christian Léger. "A survey of bootstrap methods in finite population sampling." Statist. Surv. 10 1 - 52, 2016. https://doi.org/10.1214/16-SS113

For time series:

This is an area where I don't have much experience, so I'm not sure of a good general reference paper to recommend on bootstraps for time series.

R Packages:

For surveys:

For other types of complex data

Background

The basic bootstrap methods implemented in 'rsample' are based on an assumption of independent sampling (potentially within strata), which justifies the use of independent sampling with replacement as a method of forming bootstrap resamples. But this assumption is inappropriate for many datasets used in practice. A couple big examples are complex survey data (such as the widely-used American Community Survey) or cluster-randomized experiments, where the basic bootstrap method can produce drastic underestimates of sampling error.

There are many variations of the bootstrap that have been developed for handling data that aren't simple, independent random samples (the generalized bootstrap, the block bootstrap, the rescaled bootstrap, etc.) I don't know that 'rsample' makes sense as a place to implement these various bootstraps. But I do think that users of 'rsample' would be well-served by documentation that makes them aware of the limitations of the bootstrap method implemented in the package.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions