PCGB by alxmrs · Pull Request #733 · m2lines/Samudra

alxmrs · 2026-05-06T21:31:39Z

No description provided.

…more masks/drop out.

The drop-path reference checkpoint (samudra_om4_v2_drop_path_new_data) is no longer on torch's filesystem. Pivot the diagnostic to the strongest available 1° baseline — the E1 dense+dilated ckpt (/scratch/am16581/runs/om4_samudra_v2_dense_dilated_v1/) — which the other Claude explicitly recommended as the PCGB target. boosted_model_e1.yaml mirrors E1's architecture (dilation [1,8,16,16], drop_path_rate=0, conv_next_block); boosted_pcgb_e1.yaml drives both the diagnostic and (if greenlit) the eventual PCGB training run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-GPU rtx6000, 2h timeout. Bind-mounts host src/ + configs/ + scripts/ into the container (MOUNT_SOURCE=1 default) so the new pcgb_diagnostic.py and boosted_pcgb_e1.yaml resolve without a container rebuild — same convention the kernel-branch train sbatch uses. Invokes /workspace/scripts/pcgb_diagnostic.py with the config + CKPT_PATH override; CKPT_PATH falls back to whatever resume_ckpt_path is in the YAML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The compact (lev-dim) OM4.zarr at /scratch/am16581/data/om4_onedeg_v3 makes `Normalize.__init__` fail with KeyError 'uo_0': the codebase's normal training path uses an upstream `scripts/stage_data.py` (kernel branch) that pre-flattens the zarr into level-encoded variables. We don't have stage_data on the boosting branch and the diagnostic doesn't need a trained Normalize — the model in the E1 ckpt has zero `corrector.*` state keys, so the only consumer of Normalize inside `cfg.model.build()` is dead code for this config. - `boosted_model_e1.yaml`: set `corrector: null` so the build skips both the Correctors construction AND the Normalize requirement. - `pcgb_diagnostic.py`: pass `normalize=None` to `cfg.model.build()`; also coerce TrainConfig's `backend=nccl` → `cuda` for the single-GPU eval backend (mypy fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After PR #669, the OM4 data has lev-dim prognostics but masks remain split (mask_0..mask_18). The mask names match the level-encoding regex, so _is_compact wrongly returned False, sending validate_data down the non-compact branch where with_level_index_vars only handles "var_lev_<depth>" string names — never expanding the lev dim. Result: Normalize.filter(["uo_0", ...]) hit KeyError "uo_0" at startup. Excluding mask_* from the check restores compactness detection so the filter's compact branch decodes "uo_0" -> data["uo"].isel(lev=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Dataset.to_array().reshape(-1)` broadcasts non-lev variables (zos) over the lev dim, producing 5*19=95 elements instead of the expected 4*19+1=77. This caused `assert data.shape[-3] == self._prognostic_mean_np.shape[0]` to fire in `unnormalize_tensor_prognostic` during validation. `_flatten` (defined in the same file) uses conditional_rearrange to handle mixed lev/non-lev variables correctly, producing the right per-channel order for both compact (lev-dim) and split data layouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nted source. YAML (`boosted_pcgb_e1.yaml`): - `mask_no_repeat_window: 0 → 2`. The diagnostic on the E1 ckpt showed bimodal 16-mask spread (188.8%): 8 bit-3-dropped masks cluster near 0.52 MSE vs ~0.025 for bit-3-kept. Without no-repeat, the adversarial argmax would lock onto a single mask (e.g. s1110, +2203%) every round and starve the other 15 of capacity reallocation. - `finetune: true`. Warm-start from the E1 ckpt — loads weights only, no optimizer/scheduler/epoch resumption. PCGB's round loop starts fresh from round 1 on top of the existing weights. - `experiment.name: pcgb_diagnostic_e1 → pcgb_train_e1`. sbatch (`slurm_apptainer_pcgb.sbatch`): - Adds MOUNT_SOURCE bind-mounts (src/ + configs/ + scripts/) so PCGB code and configs from the host repo are visible inside the container without a rebuild — same convention as kernel-branch train sbatch. - Adds CKPT_PATH parent-dir bind when the checkpoint lives outside DATA_ROOT/OUTPUT_BASE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The round-robin config (boosted_pcgb_round_robin.yaml) mirrors boosted_pcgb.yaml exactly except mask_searcher.schedule=round_robin — isolates the adversarial-selection contribution from the mask-cycling contribution. Paired with boosted_pcgb_no_reweight.yaml, the two ablations form a clean 2-way decomposition of the PCGB algorithm: mask cycling alone, adversarial alone, both. Also document Tier 1-3 follow-up experiments in boosted_samudra.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trainer.run() lazily calls init_data_loaders per-epoch (train.py:436), so the parent's __init__ never sets self.train_loader. PCGB.__init__ was calling _n_train_samples() — which reads self.train_loader.dataset — to size the SampleWeights tensor, triggering AttributeError on every launch. Jobs 8726020 (E1 warm-start) and 8726030 (V2 cold-start) both crashed on this in __init__ within ~2:30 of starting on torch. Fix is to call init_data_loaders explicitly in PCGB.__init__ once. Since PCGB forces steps=[1] and step_transition=[] earlier in __init__, get_current_step returns 1 and one init is sufficient — PCGB.run() does not need to re-init per round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PCGB.run() overrode Trainer.run() and skipped both train_aggregator.record_batch (train.py:539) and validate_one_epoch (train.py:686), so PCGB runs emitted only pcgb/round/* scalars — no by-variable, depth, or channel breakdowns. This made it impossible to compare PCGB against baseline trainer wandb runs at the same metric granularity. Changes: - _train_round now also constructs a per-batch TrainBatchOutput (loss + decomposed-mse-shaped loss_per_channel) and records it into a round-scoped TrainAggregator. Returns (pcgb_scalars, aggregator_logs). - Added _validate_round that mirrors Trainer.validate_one_epoch — runs the full val_loader under the *unmasked* (deployed) backbone with the standard single-scale ValidateAggregator. Image aggregators disabled. - New PCGBConfig.validate_every_n_rounds knob (default 2) gates val cadence; matches save_round_freq so val and ckpt fire on the same cycle. - run() merges train/<var> and val/<var> keys into the existing wandb log payload so PCGB plots line up directly with baseline Trainer runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

UNetBackbone with ch_width=[280,380,480,520] and n_layers=[1,1,1,1] registers 9 CoreBlocks (4 down + 1 middle + 4 up), not 10. The previous config and design doc both claimed a "+1 final ConvNeXt block" that does not exist in the backbone — the final 1x1 head is not a CoreBlock and therefore not addressable by the mask searcher. PCGB.__init__ asserts searcher.num_blocks ∈ {0, backbone.num_blocks}, so V2 crashed at startup with num_blocks=10. Flagging the design doc inconsistency separately — either the backbone needs a final CoreBlock added (matching the doc intent), or the doc needs to be updated to say 9 (matching reality). For now, match reality so V2 can run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

boosted_eval_e1.yaml mirrors boosted_eval.yaml but uses boosted_model_e1.yaml so the dilation pattern (1, 8, 16, 16) matches the E1 architecture under which the PCGB checkpoint was trained. Loading with the V1 dilation would silently apply the wrong forward- pass behavior to E1 weights. boosted_pcgb_v2.yaml: lift the E1 fix to v2 — `no_repeat_window: 2` prevents the adversarial argmax from sticking on a single cluster every round, addressing the bimodal score distribution seen in the diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

V2 PCGB (8747769) crashed at round ~6 with: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one ... module has parameters that were not used in producing loss. Parameter indices which did not receive grad for rank 0: 14 15 16 ... 102 103 When MixtureSearcher samples a mask with block_drops[i]=True, the corresponding CoreBlock's trunk is bypassed (Veit-style block skip: y = y_{i-1}), so its parameters don't receive grad on that step. DDP's bucket reducer requires every param to participate by default, so we rewrap with find_unused_parameters=True when searcher.num_blocks > 0. Skip-only masks (V1, A, B, E1) are unaffected: skip drops zero the skip *tensor* but every parameter still participates in the forward pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

V2 trained with residual_drop_rate=0.1 in addition to drop_path_rate=0.5. Both stochastic-depth modules are bypassed at eval time, but loading the model with the matching config keeps state_dict registration explicit in case any future change makes a parameter conditional on those rates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

slurm_apptainer_eval.sbatch was missing the MOUNT_SOURCE bind that the PCGB sbatch already has. Without it, the container only sees configs baked into the image at build time, and the new boosted_eval_e1.yaml and boosted_eval_v2.yaml configs (committed but not yet in a published image) couldn't be loaded — eval jobs 8820978 and 8820983 failed at the in-container pre-flight check with "config not found inside container". Mirror the PCGB convention: when MOUNT_SOURCE=1 (default), bind host {src, configs, scripts} over the container's snapshotted equivalents so source-only changes don't require a container rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the Samudra-2 paper baseline comparison script: Niño 3.4 R²/RMSE on deseasoned SST, depth-banded global-mean T R² (0-700/700-2000/ 2000-7000 m), and deseasoned T snapshot near 2022-09-30 at three depths. Compares against the published Samudra-2 (1°) numbers. Changes from the kernel-branch version: - --pred is now required (no hardcoded default to a kernel-branch run). - New --label arg parameterizes the markdown table column header so the same script works for any PCGB run (E1 v3/v4, V2, ablations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alxmrs mentioned this pull request May 6, 2026

Add Dropout to deeper/wider unet #331

Open

alxmrs and others added 12 commits May 14, 2026 13:26

New training algorithm design doc.

21eaf0c

Few show impl of PCGB after design discussions.

40f9cd0

Further discussion, adding a v1 and v2 experiments. This tests using …

4961e6a

…more masks/drop out.

revision from the other experiment.

70c1276

alxmrs force-pushed the u/alxmrs/boosting branch from f81df6d to dd8a7b8 Compare May 14, 2026 20:27

alxmrs and others added 7 commits May 14, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCGB#733

PCGB#733
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/boosting

alxmrs commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alxmrs commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant