Skip to content

PCGB#733

Draft
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/boosting
Draft

PCGB#733
alxmrs wants to merge 19 commits into
mainfrom
u/alxmrs/boosting

Conversation

@alxmrs

@alxmrs alxmrs commented May 6, 2026

Copy link
Copy Markdown
Member

No description provided.

alxmrs and others added 12 commits May 14, 2026 13:26
The drop-path reference checkpoint (samudra_om4_v2_drop_path_new_data) is
no longer on torch's filesystem. Pivot the diagnostic to the strongest
available 1° baseline — the E1 dense+dilated ckpt
(/scratch/am16581/runs/om4_samudra_v2_dense_dilated_v1/) — which the
other Claude explicitly recommended as the PCGB target.

boosted_model_e1.yaml mirrors E1's architecture (dilation [1,8,16,16],
drop_path_rate=0, conv_next_block); boosted_pcgb_e1.yaml drives both
the diagnostic and (if greenlit) the eventual PCGB training run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-GPU rtx6000, 2h timeout. Bind-mounts host src/ + configs/ + scripts/
into the container (MOUNT_SOURCE=1 default) so the new pcgb_diagnostic.py
and boosted_pcgb_e1.yaml resolve without a container rebuild — same
convention the kernel-branch train sbatch uses.

Invokes /workspace/scripts/pcgb_diagnostic.py with the config + CKPT_PATH
override; CKPT_PATH falls back to whatever resume_ckpt_path is in the YAML.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compact (lev-dim) OM4.zarr at /scratch/am16581/data/om4_onedeg_v3 makes
`Normalize.__init__` fail with KeyError 'uo_0': the codebase's normal
training path uses an upstream `scripts/stage_data.py` (kernel branch)
that pre-flattens the zarr into level-encoded variables. We don't have
stage_data on the boosting branch and the diagnostic doesn't need a
trained Normalize — the model in the E1 ckpt has zero `corrector.*` state
keys, so the only consumer of Normalize inside `cfg.model.build()` is
dead code for this config.

- `boosted_model_e1.yaml`: set `corrector: null` so the build skips both
  the Correctors construction AND the Normalize requirement.
- `pcgb_diagnostic.py`: pass `normalize=None` to `cfg.model.build()`; also
  coerce TrainConfig's `backend=nccl` → `cuda` for the single-GPU eval
  backend (mypy fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After PR #669, the OM4 data has lev-dim prognostics but masks remain
split (mask_0..mask_18). The mask names match the level-encoding regex,
so _is_compact wrongly returned False, sending validate_data down the
non-compact branch where with_level_index_vars only handles
"var_lev_<depth>" string names — never expanding the lev dim. Result:
Normalize.filter(["uo_0", ...]) hit KeyError "uo_0" at startup.

Excluding mask_* from the check restores compactness detection so the
filter's compact branch decodes "uo_0" -> data["uo"].isel(lev=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Dataset.to_array().reshape(-1)` broadcasts non-lev variables (zos) over
the lev dim, producing 5*19=95 elements instead of the expected 4*19+1=77.
This caused `assert data.shape[-3] == self._prognostic_mean_np.shape[0]`
to fire in `unnormalize_tensor_prognostic` during validation.

`_flatten` (defined in the same file) uses conditional_rearrange to handle
mixed lev/non-lev variables correctly, producing the right per-channel
order for both compact (lev-dim) and split data layouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nted source.

YAML (`boosted_pcgb_e1.yaml`):
- `mask_no_repeat_window: 0 → 2`. The diagnostic on the E1 ckpt showed
  bimodal 16-mask spread (188.8%): 8 bit-3-dropped masks cluster near
  0.52 MSE vs ~0.025 for bit-3-kept. Without no-repeat, the adversarial
  argmax would lock onto a single mask (e.g. s1110, +2203%) every round
  and starve the other 15 of capacity reallocation.
- `finetune: true`. Warm-start from the E1 ckpt — loads weights only,
  no optimizer/scheduler/epoch resumption. PCGB's round loop starts
  fresh from round 1 on top of the existing weights.
- `experiment.name: pcgb_diagnostic_e1 → pcgb_train_e1`.

sbatch (`slurm_apptainer_pcgb.sbatch`):
- Adds MOUNT_SOURCE bind-mounts (src/ + configs/ + scripts/) so PCGB code
  and configs from the host repo are visible inside the container without
  a rebuild — same convention as kernel-branch train sbatch.
- Adds CKPT_PATH parent-dir bind when the checkpoint lives outside
  DATA_ROOT/OUTPUT_BASE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The round-robin config (boosted_pcgb_round_robin.yaml) mirrors
boosted_pcgb.yaml exactly except mask_searcher.schedule=round_robin —
isolates the adversarial-selection contribution from the mask-cycling
contribution. Paired with boosted_pcgb_no_reweight.yaml, the two
ablations form a clean 2-way decomposition of the PCGB algorithm:
mask cycling alone, adversarial alone, both.

Also document Tier 1-3 follow-up experiments in boosted_samudra.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trainer.run() lazily calls init_data_loaders per-epoch (train.py:436), so
the parent's __init__ never sets self.train_loader. PCGB.__init__ was
calling _n_train_samples() — which reads self.train_loader.dataset — to
size the SampleWeights tensor, triggering AttributeError on every
launch.

Jobs 8726020 (E1 warm-start) and 8726030 (V2 cold-start) both crashed
on this in __init__ within ~2:30 of starting on torch. Fix is to call
init_data_loaders explicitly in PCGB.__init__ once. Since PCGB forces
steps=[1] and step_transition=[] earlier in __init__, get_current_step
returns 1 and one init is sufficient — PCGB.run() does not need to
re-init per round.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alxmrs alxmrs force-pushed the u/alxmrs/boosting branch from f81df6d to dd8a7b8 Compare May 14, 2026 20:27
alxmrs and others added 7 commits May 14, 2026 14:58
PCGB.run() overrode Trainer.run() and skipped both train_aggregator.record_batch
(train.py:539) and validate_one_epoch (train.py:686), so PCGB runs emitted
only pcgb/round/* scalars — no by-variable, depth, or channel breakdowns.
This made it impossible to compare PCGB against baseline trainer wandb runs
at the same metric granularity.

Changes:
- _train_round now also constructs a per-batch TrainBatchOutput (loss +
  decomposed-mse-shaped loss_per_channel) and records it into a round-scoped
  TrainAggregator. Returns (pcgb_scalars, aggregator_logs).
- Added _validate_round that mirrors Trainer.validate_one_epoch — runs the
  full val_loader under the *unmasked* (deployed) backbone with the
  standard single-scale ValidateAggregator. Image aggregators disabled.
- New PCGBConfig.validate_every_n_rounds knob (default 2) gates val cadence;
  matches save_round_freq so val and ckpt fire on the same cycle.
- run() merges train/<var> and val/<var> keys into the existing wandb log
  payload so PCGB plots line up directly with baseline Trainer runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UNetBackbone with ch_width=[280,380,480,520] and n_layers=[1,1,1,1]
registers 9 CoreBlocks (4 down + 1 middle + 4 up), not 10. The previous
config and design doc both claimed a "+1 final ConvNeXt block" that
does not exist in the backbone — the final 1x1 head is not a CoreBlock
and therefore not addressable by the mask searcher. PCGB.__init__
asserts searcher.num_blocks ∈ {0, backbone.num_blocks}, so V2 crashed
at startup with num_blocks=10.

Flagging the design doc inconsistency separately — either the backbone
needs a final CoreBlock added (matching the doc intent), or the doc
needs to be updated to say 9 (matching reality). For now, match reality
so V2 can run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
boosted_eval_e1.yaml mirrors boosted_eval.yaml but uses
boosted_model_e1.yaml so the dilation pattern (1, 8, 16, 16) matches
the E1 architecture under which the PCGB checkpoint was trained.
Loading with the V1 dilation would silently apply the wrong forward-
pass behavior to E1 weights.

boosted_pcgb_v2.yaml: lift the E1 fix to v2 — `no_repeat_window: 2`
prevents the adversarial argmax from sticking on a single cluster
every round, addressing the bimodal score distribution seen in the
diagnostic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2 PCGB (8747769) crashed at round ~6 with:
  RuntimeError: Expected to have finished reduction in the prior
  iteration before starting a new one ... module has parameters that
  were not used in producing loss. Parameter indices which did not
  receive grad for rank 0: 14 15 16 ... 102 103

When MixtureSearcher samples a mask with block_drops[i]=True, the
corresponding CoreBlock's trunk is bypassed (Veit-style block skip:
y = y_{i-1}), so its parameters don't receive grad on that step.
DDP's bucket reducer requires every param to participate by default,
so we rewrap with find_unused_parameters=True when searcher.num_blocks
> 0. Skip-only masks (V1, A, B, E1) are unaffected: skip drops zero
the skip *tensor* but every parameter still participates in the
forward pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2 trained with residual_drop_rate=0.1 in addition to drop_path_rate=0.5.
Both stochastic-depth modules are bypassed at eval time, but loading
the model with the matching config keeps state_dict registration
explicit in case any future change makes a parameter conditional on
those rates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
slurm_apptainer_eval.sbatch was missing the MOUNT_SOURCE bind that the
PCGB sbatch already has. Without it, the container only sees configs
baked into the image at build time, and the new boosted_eval_e1.yaml
and boosted_eval_v2.yaml configs (committed but not yet in a published
image) couldn't be loaded — eval jobs 8820978 and 8820983 failed at
the in-container pre-flight check with "config not found inside
container".

Mirror the PCGB convention: when MOUNT_SOURCE=1 (default), bind host
{src, configs, scripts} over the container's snapshotted equivalents
so source-only changes don't require a container rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Samudra-2 paper baseline comparison script: Niño 3.4 R²/RMSE
on deseasoned SST, depth-banded global-mean T R² (0-700/700-2000/
2000-7000 m), and deseasoned T snapshot near 2022-09-30 at three
depths. Compares against the published Samudra-2 (1°) numbers.

Changes from the kernel-branch version:
- --pred is now required (no hardcoded default to a kernel-branch run).
- New --label arg parameterizes the markdown table column header so
  the same script works for any PCGB run (E1 v3/v4, V2, ablations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant