feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656
Open
sadamov wants to merge 4 commits into
Open
feat(config): adopt #652 multi-datastore schema (config + WeatherDataset constructor only)#656sadamov wants to merge 4 commits into
sadamov wants to merge 4 commits into
Conversation
Replace the single `datastore:` top-level config field with a `datastores:` mapping keyed by user-chosen names. Each entry is a DatastoreSelection with optional per-category `inputs` / `outputs` declarations; one datastore must declare outputs (the interior / prognostic source) and zero or more may contribute input-only sources that are reserved for the model-side multi-source consumption (the mllam#652 follow-up). WeatherDataset and WeatherDataModule take `datastores` and `selections` dicts; their per-sample return shape and the model unpack are unchanged from current main, so this is a config + data-loader constructor refactor only. Internally the dataset still operates on the interior datastore alone. load_config_and_datastore returns (config, Dict[str, BaseDatastore]). A config-time validator rejects two datastores declaring the same output variable name, with an error message pointing at mdp's `dim_mapping.name_format` and `xr.Dataset.assign_coords` as the two ways to disambiguate. Other callers updated: - train_model.py resolves interior + boundary roles for the legacy single-source model side. - create_graph.py and plot_graph.py resolve the interior datastore via `_resolve_datastore_roles` instead of the old 2-tuple. - module.py refactors `_create_dataarray_from_tensor` to use a new `WeatherDataset.build_dataarray_from_tensor` staticmethod so the model doesn't need to instantiate a full WeatherDataset with the new dict signature. This PR is an alternative to mllam#635: it adopts the public schema proposed in mllam#652 without bringing in mllam#635's internal boundary loading. Boundary forcing, multi-source inputs and diagnostic outputs land via the mllam#652 model-side follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3c8a0f0 to
f40dea6
Compare
`xr.Dataset.assign_coords` returns a new dataset and does not touch
disk; the fix-hint in `_validate_output_name_collisions` claimed it
was an in-place rename, which would have led users astray. Replace
with a small zarr-python snippet that overwrites the
`{category}_feature` coord array directly, which is the actual
milliseconds-scale in-place op the message intends.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Strip the optional `inputs` / `outputs` fields from `DatastoreSelection` and the matching `_validate_output_name_collisions` validator. They were parsed but never consumed at runtime, so reviewers would have asked what they do and the honest answer was "nothing yet". Both pieces (per-category include-lists and the output-name collision validator) belong with the data-loader filtering follow-up, which must land in lockstep with @joeloskarsson's model-side adapter so feature dimensions agree. `_resolve_datastore_roles` also goes away. With no `outputs` field to distinguish interior from boundary, role resolution would be a guess anyway. Instead `load_config_and_datastore` and `WeatherDataset` now require exactly one entry in the `datastores:` dict; multi-source support comes back with the filtering follow-up. The single entry is picked as the legacy `self.datastore` interior view used by slicing/windowing/plotting. Net effect: mllam#656 is now just the `datastore:` -> `datastores:` dict shape rename. The diagnostic / filtering / collision-validator work each land in their own follow-up PRs. Also gitignore `.github/draft-*.md` since the comment-draft files are local working notes, not part of the repo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Adopt the multi-datastore configuration schema proposed in #652 as an alternative to #635. Replaces the single
datastore:top-level field with adatastores:mapping of named selections. This PR ships only the dict shape; per-category variable filtering, diagnostic outputs, and the name-collision validator each land in their own follow-up so we do not declare schema fields that have no runtime effect.What is in this PR:
NeuralLAMConfig.datastores: Dict[str, DatastoreSelection](wasdatastore: DatastoreSelection).DatastoreSelectionkeeps the existingkindandconfig_pathfields; no new schema decoration.load_config_and_datastorereturns(config, Dict[str, BaseDatastore])so all loaded datastores travel together. It enforces exactly one entry today; the limit is removed in the filtering follow-up.WeatherDatasetandWeatherDataModuletake adatastoresdict and pull the single entry as the interior alias. The per-sample return shape is unchanged from currentmain(init_states, target_states, forcing, target_times).WeatherDataset.build_dataarray_from_tensorstaticmethod so callers with only a datastore (the model inmodule.py) can construct DataArrays without instantiating a fullWeatherDatasetwith the new dict signature.What is intentionally not in this PR (each lands in a dedicated follow-up):
ForecastBatch.inputs/outputsschema fields onDatastoreSelection, plus the variable filtering insideWeatherDatasetthat consumes them. Data-loader follow-up of mine; lands in lockstep with @joeloskarsson's model-side adapter so feature dims agree.outputs:are declared.This PR is an alternative to #635, not a successor. Exactly one of #656 / #635 should land on main. If #656 goes in, #635 closes and Joel's model-side follow-up plugs boundary forcing into the multi-source dict shape introduced here. If #635 goes in, #656 closes. #636 (boundary plotting) stays on #635 today and rebases onto whichever lands.
Issue Link
refs #652, alternative to #635
Type of change
The schema change is breaking for any external YAML config or programmatic
NeuralLAMConfig/WeatherDataset/WeatherDataModuleconstruction. All in-repo example YAMLs and tests are updated.Checklist before requesting a review
test_config.pyexercises the newdatastores:dict shape)Checklist for reviewers
Author checklist after completed review
Checklist for assignee