feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only) by sadamov · Pull Request #656 · mllam/neural-lam

sadamov · 2026-06-09T07:25:55Z

Describe your changes

Adopt the multi-datastore configuration schema proposed in #652 as an alternative to #635. Replaces the single datastore: top-level field with a datastores: mapping of named selections. This PR ships only the dict shape; per-category variable filtering, diagnostic outputs, and the name-collision validator each land in their own follow-up so we do not declare schema fields that have no runtime effect.

What is in this PR:

NeuralLAMConfig.datastores: Dict[str, DatastoreSelection] (was datastore: DatastoreSelection).
DatastoreSelection keeps the existing kind and config_path fields; no new schema decoration.
load_config_and_datastore returns (config, Dict[str, BaseDatastore]) so all loaded datastores travel together. It enforces exactly one entry today; the limit is removed in the filtering follow-up.
WeatherDataset and WeatherDataModule take a datastores dict and pull the single entry as the interior alias. The per-sample return shape is unchanged from current main (init_states, target_states, forcing, target_times).
New WeatherDataset.build_dataarray_from_tensor staticmethod so callers with only a datastore (the model in module.py) can construct DataArrays without instantiating a full WeatherDataset with the new dict signature.

What is intentionally not in this PR (each lands in a dedicated follow-up):

Boundary forcing in the per-sample tuple. @joeloskarsson's model-side follow-up brings this in via a per-source ForecastBatch.
Per-category inputs / outputs schema fields on DatastoreSelection, plus the variable filtering inside WeatherDataset that consumes them. Data-loader follow-up of mine; lands in lockstep with @joeloskarsson's model-side adapter so feature dims agree.
Diagnostic outputs (predict-only categories). Orthogonal to the schema and the filtering work, separate follow-up.
Variable name collision validator across datastores. Lands together with the filtering follow-up since the validator only matters once outputs: are declared.
Multi-datastore support beyond the single-entry case. Returns with the filtering / model-side adapter PRs.

This PR is an alternative to #635, not a successor. Exactly one of #656 / #635 should land on main. If #656 goes in, #635 closes and Joel's model-side follow-up plugs boundary forcing into the multi-source dict shape introduced here. If #635 goes in, #656 closes. #636 (boundary plotting) stays on #635 today and rebases onto whichever lands.

Issue Link

refs #652, alternative to #635

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

The schema change is breaking for any external YAML config or programmatic NeuralLAMConfig / WeatherDataset / WeatherDataModule construction. All in-repo example YAMLs and tests are updated.

Checklist before requesting a review

My branch is up-to-date with the target branch
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes (no user-facing CLI change; the schema change shows up in the example config only)
I have added tests that prove my fix is effective or that my feature works (existing test suite passes; test_config.py exercises the new datastores: dict shape)
I have given the PR a name that clearly describes the change, written in imperative form
I have requested a reviewer and an assignee

Checklist for reviewers

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (changed).

Checklist for assignee

PR is up to date with the base branch
the tests pass
(if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone
author has added an entry to the changelog
Once the PR is ready to be merged, squash commits and merge the PR.

Replace the single `datastore:` top-level config field with a `datastores:` mapping keyed by user-chosen names. Each entry is a DatastoreSelection with optional per-category `inputs` / `outputs` declarations; one datastore must declare outputs (the interior / prognostic source) and zero or more may contribute input-only sources that are reserved for the model-side multi-source consumption (the mllam#652 follow-up). WeatherDataset and WeatherDataModule take `datastores` and `selections` dicts; their per-sample return shape and the model unpack are unchanged from current main, so this is a config + data-loader constructor refactor only. Internally the dataset still operates on the interior datastore alone. load_config_and_datastore returns (config, Dict[str, BaseDatastore]). A config-time validator rejects two datastores declaring the same output variable name, with an error message pointing at mdp's `dim_mapping.name_format` and `xr.Dataset.assign_coords` as the two ways to disambiguate. Other callers updated: - train_model.py resolves interior + boundary roles for the legacy single-source model side. - create_graph.py and plot_graph.py resolve the interior datastore via `_resolve_datastore_roles` instead of the old 2-tuple. - module.py refactors `_create_dataarray_from_tensor` to use a new `WeatherDataset.build_dataarray_from_tensor` staticmethod so the model doesn't need to instantiate a full WeatherDataset with the new dict signature. This PR is an alternative to mllam#635: it adopts the public schema proposed in mllam#652 without bringing in mllam#635's internal boundary loading. Boundary forcing, multi-source inputs and diagnostic outputs land via the mllam#652 model-side follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Match the marker registration used on mllam#635/mllam#651 so any future @pytest.mark.slow test on this branch is recognised by pytest without warnings. No tests currently use the marker on mllam#656. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`xr.Dataset.assign_coords` returns a new dataset and does not touch disk; the fix-hint in `_validate_output_name_collisions` claimed it was an in-place rename, which would have led users astray. Replace with a small zarr-python snippet that overwrites the `{category}_feature` coord array directly, which is the actual milliseconds-scale in-place op the message intends. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@joeloskarsson

Strip the optional `inputs` / `outputs` fields from `DatastoreSelection` and the matching `_validate_output_name_collisions` validator. They were parsed but never consumed at runtime, so reviewers would have asked what they do and the honest answer was "nothing yet". Both pieces (per-category include-lists and the output-name collision validator) belong with the data-loader filtering follow-up, which must land in lockstep with @joeloskarsson's model-side adapter so feature dimensions agree. `_resolve_datastore_roles` also goes away. With no `outputs` field to distinguish interior from boundary, role resolution would be a guess anyway. Instead `load_config_and_datastore` and `WeatherDataset` now require exactly one entry in the `datastores:` dict; multi-source support comes back with the filtering follow-up. The single entry is picked as the legacy `self.datastore` interior view used by slicing/windowing/plotting. Net effect: mllam#656 is now just the `datastore:` -> `datastores:` dict shape rename. The diagnostic / filtering / collision-validator work each land in their own follow-up PRs. Also gitignore `.github/draft-*.md` since the comment-draft files are local working notes, not part of the repo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sadamov force-pushed the feat/multi-datastore-config branch from 3c8a0f0 to f40dea6 Compare June 9, 2026 07:26

sadamov and others added 2 commits June 9, 2026 09:29

sadamov mentioned this pull request Jun 9, 2026

[RFC/Design]: Support for multiple datastores #652

Open

sadamov requested review from joeloskarsson and leifdenby June 9, 2026 07:48

sadamov self-assigned this Jun 9, 2026

sadamov added enhancement New feature or request discussion labels Jun 9, 2026

sadamov added this to the v0.8.0 milestone Jun 9, 2026

sadamov requested review from SimonKamuk and observingClouds June 9, 2026 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656

feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656
sadamov wants to merge 4 commits into
mllam:mainfrom
sadamov:feat/multi-datastore-config

sadamov commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sadamov commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sadamov commented Jun 9, 2026 •

edited

Loading