Skip to content

feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656

Open
sadamov wants to merge 4 commits into
mllam:mainfrom
sadamov:feat/multi-datastore-config
Open

feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656
sadamov wants to merge 4 commits into
mllam:mainfrom
sadamov:feat/multi-datastore-config

Conversation

@sadamov

@sadamov sadamov commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Describe your changes

Adopt the multi-datastore configuration schema proposed in #652 as an alternative to #635. Replaces the single datastore: top-level field with a datastores: mapping of named selections. This PR ships only the dict shape; per-category variable filtering, diagnostic outputs, and the name-collision validator each land in their own follow-up so we do not declare schema fields that have no runtime effect.

What is in this PR:

  • NeuralLAMConfig.datastores: Dict[str, DatastoreSelection] (was datastore: DatastoreSelection).
  • DatastoreSelection keeps the existing kind and config_path fields; no new schema decoration.
  • load_config_and_datastore returns (config, Dict[str, BaseDatastore]) so all loaded datastores travel together. It enforces exactly one entry today; the limit is removed in the filtering follow-up.
  • WeatherDataset and WeatherDataModule take a datastores dict and pull the single entry as the interior alias. The per-sample return shape is unchanged from current main (init_states, target_states, forcing, target_times).
  • New WeatherDataset.build_dataarray_from_tensor staticmethod so callers with only a datastore (the model in module.py) can construct DataArrays without instantiating a full WeatherDataset with the new dict signature.

What is intentionally not in this PR (each lands in a dedicated follow-up):

  • Boundary forcing in the per-sample tuple. @joeloskarsson's model-side follow-up brings this in via a per-source ForecastBatch.
  • Per-category inputs / outputs schema fields on DatastoreSelection, plus the variable filtering inside WeatherDataset that consumes them. Data-loader follow-up of mine; lands in lockstep with @joeloskarsson's model-side adapter so feature dims agree.
  • Diagnostic outputs (predict-only categories). Orthogonal to the schema and the filtering work, separate follow-up.
  • Variable name collision validator across datastores. Lands together with the filtering follow-up since the validator only matters once outputs: are declared.
  • Multi-datastore support beyond the single-entry case. Returns with the filtering / model-side adapter PRs.

This PR is an alternative to #635, not a successor. Exactly one of #656 / #635 should land on main. If #656 goes in, #635 closes and Joel's model-side follow-up plugs boundary forcing into the multi-source dict shape introduced here. If #635 goes in, #656 closes. #636 (boundary plotting) stays on #635 today and rebases onto whichever lands.

Issue Link

refs #652, alternative to #635

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

The schema change is breaking for any external YAML config or programmatic NeuralLAMConfig / WeatherDataset / WeatherDataModule construction. All in-repo example YAMLs and tests are updated.

Checklist before requesting a review

  • My branch is up-to-date with the target branch
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes (no user-facing CLI change; the schema change shows up in the example config only)
  • I have added tests that prove my fix is effective or that my feature works (existing test suite passes; test_config.py exercises the new datastores: dict shape)
  • I have given the PR a name that clearly describes the change, written in imperative form
  • I have requested a reviewer and an assignee

Checklist for reviewers

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (changed).

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • (if the PR is not just maintenance/bugfix) the PR is assigned to the next milestone
  • author has added an entry to the changelog
  • Once the PR is ready to be merged, squash commits and merge the PR.

Replace the single `datastore:` top-level config field with a
`datastores:` mapping keyed by user-chosen names. Each entry is a
DatastoreSelection with optional per-category `inputs` / `outputs`
declarations; one datastore must declare outputs (the interior /
prognostic source) and zero or more may contribute input-only sources
that are reserved for the model-side multi-source consumption (the
mllam#652 follow-up).

WeatherDataset and WeatherDataModule take `datastores` and
`selections` dicts; their per-sample return shape and the model unpack
are unchanged from current main, so this is a config + data-loader
constructor refactor only. Internally the dataset still operates on
the interior datastore alone.

load_config_and_datastore returns (config, Dict[str, BaseDatastore]).
A config-time validator rejects two datastores declaring the same
output variable name, with an error message pointing at mdp's
`dim_mapping.name_format` and `xr.Dataset.assign_coords` as the two
ways to disambiguate.

Other callers updated:
- train_model.py resolves interior + boundary roles for the legacy
  single-source model side.
- create_graph.py and plot_graph.py resolve the interior datastore
  via `_resolve_datastore_roles` instead of the old 2-tuple.
- module.py refactors `_create_dataarray_from_tensor` to use a new
  `WeatherDataset.build_dataarray_from_tensor` staticmethod so the
  model doesn't need to instantiate a full WeatherDataset with the
  new dict signature.

This PR is an alternative to mllam#635: it adopts the public schema
proposed in mllam#652 without bringing in mllam#635's internal boundary
loading. Boundary forcing, multi-source inputs and diagnostic
outputs land via the mllam#652 model-side follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sadamov sadamov force-pushed the feat/multi-datastore-config branch from 3c8a0f0 to f40dea6 Compare June 9, 2026 07:26
sadamov and others added 2 commits June 9, 2026 09:29
Match the marker registration used on mllam#635/mllam#651 so any future
@pytest.mark.slow test on this branch is recognised by pytest
without warnings. No tests currently use the marker on mllam#656.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`xr.Dataset.assign_coords` returns a new dataset and does not touch
disk; the fix-hint in `_validate_output_name_collisions` claimed it
was an in-place rename, which would have led users astray. Replace
with a small zarr-python snippet that overwrites the
`{category}_feature` coord array directly, which is the actual
milliseconds-scale in-place op the message intends.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sadamov sadamov self-assigned this Jun 9, 2026
@sadamov sadamov added enhancement New feature or request discussion labels Jun 9, 2026
@sadamov sadamov added this to the v0.8.0 milestone Jun 9, 2026
Strip the optional `inputs` / `outputs` fields from `DatastoreSelection`
and the matching `_validate_output_name_collisions` validator. They
were parsed but never consumed at runtime, so reviewers would have
asked what they do and the honest answer was "nothing yet".

Both pieces (per-category include-lists and the output-name collision
validator) belong with the data-loader filtering follow-up, which
must land in lockstep with @joeloskarsson's model-side adapter so
feature dimensions agree.

`_resolve_datastore_roles` also goes away. With no `outputs` field to
distinguish interior from boundary, role resolution would be a guess
anyway. Instead `load_config_and_datastore` and `WeatherDataset` now
require exactly one entry in the `datastores:` dict; multi-source
support comes back with the filtering follow-up. The single entry is
picked as the legacy `self.datastore` interior view used by
slicing/windowing/plotting.

Net effect: mllam#656 is now just the `datastore:` -> `datastores:` dict
shape rename. The diagnostic / filtering / collision-validator work
each land in their own follow-up PRs.

Also gitignore `.github/draft-*.md` since the comment-draft files are
local working notes, not part of the repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discussion enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant