PCGB#733
Draft
alxmrs wants to merge 19 commits into
Draft
Conversation
…more masks/drop out.
The drop-path reference checkpoint (samudra_om4_v2_drop_path_new_data) is no longer on torch's filesystem. Pivot the diagnostic to the strongest available 1° baseline — the E1 dense+dilated ckpt (/scratch/am16581/runs/om4_samudra_v2_dense_dilated_v1/) — which the other Claude explicitly recommended as the PCGB target. boosted_model_e1.yaml mirrors E1's architecture (dilation [1,8,16,16], drop_path_rate=0, conv_next_block); boosted_pcgb_e1.yaml drives both the diagnostic and (if greenlit) the eventual PCGB training run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-GPU rtx6000, 2h timeout. Bind-mounts host src/ + configs/ + scripts/ into the container (MOUNT_SOURCE=1 default) so the new pcgb_diagnostic.py and boosted_pcgb_e1.yaml resolve without a container rebuild — same convention the kernel-branch train sbatch uses. Invokes /workspace/scripts/pcgb_diagnostic.py with the config + CKPT_PATH override; CKPT_PATH falls back to whatever resume_ckpt_path is in the YAML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compact (lev-dim) OM4.zarr at /scratch/am16581/data/om4_onedeg_v3 makes `Normalize.__init__` fail with KeyError 'uo_0': the codebase's normal training path uses an upstream `scripts/stage_data.py` (kernel branch) that pre-flattens the zarr into level-encoded variables. We don't have stage_data on the boosting branch and the diagnostic doesn't need a trained Normalize — the model in the E1 ckpt has zero `corrector.*` state keys, so the only consumer of Normalize inside `cfg.model.build()` is dead code for this config. - `boosted_model_e1.yaml`: set `corrector: null` so the build skips both the Correctors construction AND the Normalize requirement. - `pcgb_diagnostic.py`: pass `normalize=None` to `cfg.model.build()`; also coerce TrainConfig's `backend=nccl` → `cuda` for the single-GPU eval backend (mypy fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After PR #669, the OM4 data has lev-dim prognostics but masks remain split (mask_0..mask_18). The mask names match the level-encoding regex, so _is_compact wrongly returned False, sending validate_data down the non-compact branch where with_level_index_vars only handles "var_lev_<depth>" string names — never expanding the lev dim. Result: Normalize.filter(["uo_0", ...]) hit KeyError "uo_0" at startup. Excluding mask_* from the check restores compactness detection so the filter's compact branch decodes "uo_0" -> data["uo"].isel(lev=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Dataset.to_array().reshape(-1)` broadcasts non-lev variables (zos) over the lev dim, producing 5*19=95 elements instead of the expected 4*19+1=77. This caused `assert data.shape[-3] == self._prognostic_mean_np.shape[0]` to fire in `unnormalize_tensor_prognostic` during validation. `_flatten` (defined in the same file) uses conditional_rearrange to handle mixed lev/non-lev variables correctly, producing the right per-channel order for both compact (lev-dim) and split data layouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nted source. YAML (`boosted_pcgb_e1.yaml`): - `mask_no_repeat_window: 0 → 2`. The diagnostic on the E1 ckpt showed bimodal 16-mask spread (188.8%): 8 bit-3-dropped masks cluster near 0.52 MSE vs ~0.025 for bit-3-kept. Without no-repeat, the adversarial argmax would lock onto a single mask (e.g. s1110, +2203%) every round and starve the other 15 of capacity reallocation. - `finetune: true`. Warm-start from the E1 ckpt — loads weights only, no optimizer/scheduler/epoch resumption. PCGB's round loop starts fresh from round 1 on top of the existing weights. - `experiment.name: pcgb_diagnostic_e1 → pcgb_train_e1`. sbatch (`slurm_apptainer_pcgb.sbatch`): - Adds MOUNT_SOURCE bind-mounts (src/ + configs/ + scripts/) so PCGB code and configs from the host repo are visible inside the container without a rebuild — same convention as kernel-branch train sbatch. - Adds CKPT_PATH parent-dir bind when the checkpoint lives outside DATA_ROOT/OUTPUT_BASE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The round-robin config (boosted_pcgb_round_robin.yaml) mirrors boosted_pcgb.yaml exactly except mask_searcher.schedule=round_robin — isolates the adversarial-selection contribution from the mask-cycling contribution. Paired with boosted_pcgb_no_reweight.yaml, the two ablations form a clean 2-way decomposition of the PCGB algorithm: mask cycling alone, adversarial alone, both. Also document Tier 1-3 follow-up experiments in boosted_samudra.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trainer.run() lazily calls init_data_loaders per-epoch (train.py:436), so the parent's __init__ never sets self.train_loader. PCGB.__init__ was calling _n_train_samples() — which reads self.train_loader.dataset — to size the SampleWeights tensor, triggering AttributeError on every launch. Jobs 8726020 (E1 warm-start) and 8726030 (V2 cold-start) both crashed on this in __init__ within ~2:30 of starting on torch. Fix is to call init_data_loaders explicitly in PCGB.__init__ once. Since PCGB forces steps=[1] and step_transition=[] earlier in __init__, get_current_step returns 1 and one init is sufficient — PCGB.run() does not need to re-init per round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PCGB.run() overrode Trainer.run() and skipped both train_aggregator.record_batch (train.py:539) and validate_one_epoch (train.py:686), so PCGB runs emitted only pcgb/round/* scalars — no by-variable, depth, or channel breakdowns. This made it impossible to compare PCGB against baseline trainer wandb runs at the same metric granularity. Changes: - _train_round now also constructs a per-batch TrainBatchOutput (loss + decomposed-mse-shaped loss_per_channel) and records it into a round-scoped TrainAggregator. Returns (pcgb_scalars, aggregator_logs). - Added _validate_round that mirrors Trainer.validate_one_epoch — runs the full val_loader under the *unmasked* (deployed) backbone with the standard single-scale ValidateAggregator. Image aggregators disabled. - New PCGBConfig.validate_every_n_rounds knob (default 2) gates val cadence; matches save_round_freq so val and ckpt fire on the same cycle. - run() merges train/<var> and val/<var> keys into the existing wandb log payload so PCGB plots line up directly with baseline Trainer runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UNetBackbone with ch_width=[280,380,480,520] and n_layers=[1,1,1,1]
registers 9 CoreBlocks (4 down + 1 middle + 4 up), not 10. The previous
config and design doc both claimed a "+1 final ConvNeXt block" that
does not exist in the backbone — the final 1x1 head is not a CoreBlock
and therefore not addressable by the mask searcher. PCGB.__init__
asserts searcher.num_blocks ∈ {0, backbone.num_blocks}, so V2 crashed
at startup with num_blocks=10.
Flagging the design doc inconsistency separately — either the backbone
needs a final CoreBlock added (matching the doc intent), or the doc
needs to be updated to say 9 (matching reality). For now, match reality
so V2 can run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
boosted_eval_e1.yaml mirrors boosted_eval.yaml but uses boosted_model_e1.yaml so the dilation pattern (1, 8, 16, 16) matches the E1 architecture under which the PCGB checkpoint was trained. Loading with the V1 dilation would silently apply the wrong forward- pass behavior to E1 weights. boosted_pcgb_v2.yaml: lift the E1 fix to v2 — `no_repeat_window: 2` prevents the adversarial argmax from sticking on a single cluster every round, addressing the bimodal score distribution seen in the diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2 PCGB (8747769) crashed at round ~6 with:
RuntimeError: Expected to have finished reduction in the prior
iteration before starting a new one ... module has parameters that
were not used in producing loss. Parameter indices which did not
receive grad for rank 0: 14 15 16 ... 102 103
When MixtureSearcher samples a mask with block_drops[i]=True, the
corresponding CoreBlock's trunk is bypassed (Veit-style block skip:
y = y_{i-1}), so its parameters don't receive grad on that step.
DDP's bucket reducer requires every param to participate by default,
so we rewrap with find_unused_parameters=True when searcher.num_blocks
> 0. Skip-only masks (V1, A, B, E1) are unaffected: skip drops zero
the skip *tensor* but every parameter still participates in the
forward pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2 trained with residual_drop_rate=0.1 in addition to drop_path_rate=0.5. Both stochastic-depth modules are bypassed at eval time, but loading the model with the matching config keeps state_dict registration explicit in case any future change makes a parameter conditional on those rates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
slurm_apptainer_eval.sbatch was missing the MOUNT_SOURCE bind that the
PCGB sbatch already has. Without it, the container only sees configs
baked into the image at build time, and the new boosted_eval_e1.yaml
and boosted_eval_v2.yaml configs (committed but not yet in a published
image) couldn't be loaded — eval jobs 8820978 and 8820983 failed at
the in-container pre-flight check with "config not found inside
container".
Mirror the PCGB convention: when MOUNT_SOURCE=1 (default), bind host
{src, configs, scripts} over the container's snapshotted equivalents
so source-only changes don't require a container rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Samudra-2 paper baseline comparison script: Niño 3.4 R²/RMSE on deseasoned SST, depth-banded global-mean T R² (0-700/700-2000/ 2000-7000 m), and deseasoned T snapshot near 2022-09-30 at three depths. Compares against the published Samudra-2 (1°) numbers. Changes from the kernel-branch version: - --pred is now required (no hardcoded default to a kernel-branch run). - New --label arg parameterizes the markdown table column header so the same script works for any PCGB run (E1 v3/v4, V2, ablations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.