Skip to content

Reworking sample data #472

@lewisjared

Description

@lewisjared

The problem

As a developer adding an extra diagnostics, the sample data is getting unwieldy.
It requires a multi-step process to generate, release, update versions in REF and rerunning regression tests.
This is a blocker for users contributing.

Having a common set of sample data when we built the first prototype was useful, but now it is getting in the way. Updating the datasets causes unintended consequences for other packages.

** Pain points**:

  • Separate repo and versioning for sample data. This required a multi-step process to add new data for a diagnostic
  • Decimating required ongoing maintenance
  • Different packages required different shaped sample data, but we currently assume a shared set of sample data
  • Updating the sample release caused changes to other packages diagnostics
  • Required building via CI

** What worked:**

  • The regression test output for each diagnostic to iterate quickly on the output bundle format without having to rerun the diagnostic
  • Edge-cases where useful to identify for the system as a whole (even if it was annoying if you were trying to work on a single diagnostic)
  • Smaller datasets allowed for real end-to-end tests in a few minutes instead of hours
  • 100MB of local data instead of 100GB
  • Specifying the data requirements in terms of ESGF facets
  • Documentation of the required input data
  • Fetching and deduping data via intake-esgf

Proposed solution

Upon discussion with @bouweandela, an alternative way for b

The gist is that we no longer maintain a single set of sample data,
instead each package maintains a set of datasets they require.
This decouples each provider from each other and makes it easier for contributing new diagnostics.

  • Remove the centralised sample data
  • Packages maintain the set of test data they require (not sure if this is per-diagnostic or per-provider)
  • We reuse the classes/fetching routines from the sample data repo
  • Package developers download these data locally (from ESGF, not decimated)
  • Regression tests for packages are selectively generated (
  • We include a larger data catalog based on real world CMIP6 datasets on ESGF
  • We generate regression output showing the result of a solve on this data catalog. This is tracked a regression output so changes to selectors are visible.
  • PR tests are simpler (scoped to packages)
  • Weekly tests on main to check that everything runs

The CI will require a local archive of all applicable datasets which suggests that GHA's aren't applicable as the required volume would exceed the max cache size.
Developers working on the core or across providers will likely need a significant volume of data.
We currently already handle that by Climate Resource hosting a runner which can access a cache of datasets.

If there are specific dataset that are needed for the CI/testing, we could still sync to a publicly accessible S3 bucket to be fetched along side the other data requirements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions