U/alxmrs/experiments/kr1 v2 by alxmrs · Pull Request #732 · m2lines/Samudra

alxmrs · 2026-05-04T19:03:40Z

No description provided.

alxmrs · 2026-05-04T19:08:03Z

Looking at wandb, it seems like there are serious bugs in either the rollout or the model:

The second image indicates that it is a bug in the rollout specifically. Though, I don't know how _gen plot are calculated. If I understand the third image (the _target), maybe the model is fine but the rollout is busted.

jder · 2026-05-05T17:47:14Z

@alxmrs I believe "_target" graphs are just "the gold data we are expecting to predict". The "_gen" plots are the (autoregressive) model outputs

# Conflicts: # scripts/slurm_apptainer_train.sbatch # Conflicts: # .gitignore # Conflicts: # .gitignore

# Conflicts: # src/ocean_emulators/datasets.py

…baseline

…l (v29 ~27%)

… ~36%)

…~46%)

…egressed to 36%)

…ill ~36%)

Add per_scale_batch_size config field threaded through the equivalence group batch samplers. For the KR1 multi-scale run set [1, 2, 4] across [¼°, ½°, 1°] so the lower-res scales (which had 27 GB of headroom on RTX6000) amortize Python/kernel-launch overhead and lift GPU utilization past the v31 ~46% plateau. ¼° stays at bs=1 (memory-bound). Revert gradient_accumulation_steps to 1 since v33's no_sync experiment showed DDP allreduce wasn't the bottleneck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v35 with [1,2,4] crashed with CUDA illegal memory access during a gemm at n=2211840, which corresponds to the 1° batch=4 forward pass through the FOMO perceiver. Drop the 1° batch back to 2 — still doubles throughput on ½° and 1° vs. uniform bs=1 without tripping the size limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v35 ([1,2,4]) and v36 ([1,2,2]) both hit a CUDA illegal memory access right after model construction, before iter 1, with no Python frame in the C++ stack. Force synchronous launches so the next failure points at the actual Python call site. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v37 still produced an async-only stack because apptainer strips host env vars unless prefixed APPTAINERENV_. Use the prefix so the flag actually reaches the python process inside the container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v35-v38 all crashed with CUDA illegal memory access at bs>1 in any group. CUDA_LAUNCH_BLOCKING (passed via APPTAINERENV_) couldn't pin the originating kernel because the abort fires inside ~TensorImpl() during GC, well after the offending kernel was queued — likely a latent bs>1 bug in FOMO's encoder/perceiver that v34 never exercised. Reverting per_scale_batch_size to null to ship the v31 baseline. The sampler infrastructure (config field, group-key→bs mapping, sampler support for int|list[int]) stays in place for a future debug pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The torchrun invocation inside the apptainer bash -c block was supposed to be captured into TRAIN_CMD="...", but the opening `TRAIN_CMD=\"` was dropped in 5a59ac4, leaving only the closing `\"` on line 379. The dangling quote opened an unterminated string in the inner bash, causing a syntax error ~line 65 of the inner script and immediate job failure (see slurm-6428130: 7s failure, never reached Python). Restore the `TRAIN_CMD=\"` assignment so the later `\${TRAIN_CMD} &` branches actually have something to invoke. Switch `[@]` → `[*]` since we're stringifying the array for the variable assignment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Container c4e7738 nests the flag under data.loading.num_workers (train.py rejected --data.num_workers in job 6796531 after argparse validation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Single-scale 1/2° rollout evaluation of the multi-scale FOMO model trained in part 1 (v43, epoch 19). Uses the eval_multiscale machinery from #652 with halfdeg as the first (evaluated) data source; the other two scales remain present so the encoder builds with the correct max grid for the Perceiver rotary positional embedding. boundary_vars_key matches training config (tau_hfds) so the v43 EMA checkpoint loads without channel-count mismatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Group QOS CPU cap is saturated; at 8 CPUs the scheduler estimated a 15h wait. Eval is single-GPU with one batch at a time — 4 dataloader workers is plenty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The default landing partition (rtx6000_lzanna) has GrpCPUs=128 and is fully consumed by torch_pr_144_general jobs with ~46h remaining. The unrestricted rtx6000 partition has the same nodes available and 0 CPUs in use. Explicit pinning gets the eval running immediately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Account is restricted to rtx6000_lzanna for RTX6000 jobs (despite AllowAccounts=ALL on other partitions, sbatch rejects them with 'partition not valid for this job'). That QOS is fully consumed by torch_pr_144_general jobs with ~46h remaining. h200_courant is accessible to our account, has 414 free CPUs out of 768, and a test-submit estimates start in ~2h. Eval is GPU-agnostic (same checkpoint loads on any CUDA device), so the change is safe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The eval container resolves CONFIG paths to /workspace/configs/..., which is baked at build time. Adding new configs (e.g. the halfdeg eval config) required a container rebuild before the job could find them. Binding ${REPO_DIR}/configs over /workspace/configs lets a freshly-pulled checkout drive the run without rebuilding. The training sbatch already does this implicitly via /workspace binding patterns; this brings eval to parity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ents + reentry flag) + make sure that the proper exit codes are set (first, to capture the signal code, then to actually exit with the right return code).

Mask variables (mask_0..mask_18) match _var_name_encode_level's _[0-9]+ regex and were forcing _is_compact to return False for compact-form data. Without this filter, DataSource.filter() takes the non-compact branch which tries data["uo_0"] on a dataset whose only prognostic variables are uo, vo, thetao, so, zos -> KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a means/stds dataset mixes depth-resolved variables (with a lev dim) and surface variables, plain to_array().reshape(-1) broadcasts surface vars across all levels, producing too many channels (95 instead of 77 for thermo_dynamic_all). _flatten_var_lev keeps depth vars per-level and surface vars as one channel each, matching the prognostic tensor channel layout. Restores the fix from 604a047 that 776072b (partial reversion of data.py) rolled back. Multi-scale 8048549 crashed at the post-Epoch-1 validation unnormalize step without it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The per-scale loss accumulator stores running sums on CPU (batch.loss.detach().cpu()) to avoid pinning GPU memory, but get_logs then calls all_reduce_mean. The default DDP process group is NCCL, which does not accept CPU tensors. Move the local mean to the current CUDA device before reducing, then back to CPU for the float cast. Caught when multi-scale 8048549 ran for 46 minutes through Epoch 1 training + validation, then crashed in get_logs with: RuntimeError No backend type associated with device type cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This updates the train script, our torch docs and the sbatch script to allow us to make use of preemptable computing resources on torch. This PR is more of a refresh than anything else.

…tch comments to turn on preemption.

Configured directly in YAML (not via CLI overrides): - model.yaml: pred_residuals false (was true), decoder.context_patches 3 (was 1) - train_multiscale.yaml: data.hist 0 (was default 1) Launch script: bump NAME_SUFFIX -> v48, PREEMPTIBLE=1 (use new PR #626 preempt-resume code), wandb group kr1_v48, drop CLI hparam overrides. Rationale: - hist=0: ocean state (T,S,u,v) is approximately Markovian; history adds channels without strong physics justification. - pred_residuals=false: v47 1-yr rollout showed sawtooth oscillation that amplifies over time, classic failure mode of unstable residual predictor. - decoder.context_patches=3: time-mean thetao plot from v47 rollout shows visible decoder-window seam (~72x120 px rectangle). Bumping context rings 1->3 grows window data context from 14x14=196 -> 18x18=324 latent tokens, sharing more context across adjacent windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #626 sbatch dropped the src+configs binds; without them the container uses its stale baked-in code, so source-only changes (incl. the preempt code we just added) wouldnt take effect. Restore the bind pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #626 added a bind for ${RUN_DIR}/wandb but didnt create it; apptainer errors with "mount source doesnt exist" at startup, before the trainer runs (so preempt-resume cant kick in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #626 attempted to set Comment="preemption=yes;requeue=true" via scontrol update from inside the job, but slurm rejects this for non-admin users with "Unspecified error". The comment must be set at submit time to opt into the Torch clusters preemption-friendly partition behavior. Without it, slurm cancels (not requeues) on QOS-priority preemption, which is what killed v47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lab-priority partition rtx6000_lzanna has PreemptMode=OFF — meaning slurm wont auto-requeue cancelled jobs even with --requeue + Comment. The shared rtx6000 partition is configured with PreemptMode=REQUEUE (verified via scontrol show partition). Switching there sacrifices a small amount of priority for working auto-requeue on preemption. With intra-epoch checkpoints (CHECKPOINT_BATCH_INTERVAL=100, ~3 min cadence), worst-case loss per preemption is ~3 min of compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The launch script exports NCCL_P2P_DISABLE, NCCL_IB_DISABLE, UCX_TLS, NCCL_NET etc. as workarounds for gr101/gr102 NCCL/UCX segfaults. Apptainer drops these by default; the APPTAINERENV_ prefix forwards them into the container. Without this, 8237766 ran 2h on first attempt (lucky), got preempted, auto-requeued (Restarts=1, working as designed), but segfaulted 14s into the resumed attempt at NCCL init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

steps: [1, 2, 4, 8] with step_transition: [5, 15, 30]: epochs 1-4 → 1-step rollout (initial fitting) epochs 5-14 → 2-step epochs 15-29 → 4-step epochs 30-70 → 8-step Inherits v48's other deltas (hist=0, pred_residuals=false, decoder.context_patches=3). Saved as a separate file so v48's in-flight runs that re-read train_multiscale.yaml on each requeue are unaffected. Rationale: 1-step training only ever sees ground-truth as input, but rollout sees the model's own predictions; backprop through multi-step composition forces f's iterated Jacobian to have stable spectral properties. Critical for our 10yr / ~730-step rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The conditional APPTAINERENV_ propagation worked on first launch but not on slurm requeues — slurm doesnt reliably re-export user-shell vars when restarting a job. Both v48 attempts that auto-requeued segfaulted at NCCL/CUDA init for this reason. Hardcode the gr101/gr102 NCCL workarounds directly in the sbatch script so they survive every requeue regardless of submitting-shell state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ise) slurm_apptainer_train.sbatch computes NAME=$(date +%Y-%m-%d)-$NAME_SUFFIX when NAME is unset. Each requeue or chain-link runs the sbatch fresh and recomputes NAME from todays date, creating a NEW run dir per day and losing access to the previous checkpoint. Fix: launch_kr1_train.sh now exports NAME once at submission, fixing the date for the entire lifetime of the chain. Override via NAME=... env var to resume into a specific existing dir. This caused us to silently lose epoch 10 progress (val 0.255) when the chain link ran on a different day from submission and started fresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

h200_courant is contended (QOSGrpGRES kills evals at ~35s before rollout begins). docs/torch.md examples use rtx6000_lzanna with gres=gpu:rtx6000:1 (the lab-priority partition); follow that pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Training used hist=0; without this override the eval model is built with the default hist=1, doubling decoder channels (77 -> 154) and failing state_dict load with shape mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v49 NaN'd at the 1->2-step curriculum transition (epoch 20, iter 1) with pred_residuals=true. Going conservative for v50: residuals=false to avoid the v47 sawtooth, keep hist=1 to avoid v48's climatology collapse, restore v48's LR. PerScaleSnapshotValidateAggregator now also routes each scale through a StdRatioAggregator so val/{H}x{W}/std_ratio/{var} actually surfaces under multi-scale FOMO training (it was a no-op in v49 because we only wired it into the base ValidateAggregator path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lxmrs/experiments/kr1-v2

The PhysicsNeMo container bakes WANDB_GIT_COMMIT/WANDB_GIT_REMOTE_URL into the image at build time and also sets WANDB_DISABLE_GIT=true, so wandb forever reports the SHA from the image build rather than the host commit that's actually executing (via the bind-mounted src/ + configs/). Override both env vars from the host's REPO_DIR via APPTAINERENV_* so wandb labels each run with the real running commit. Appends " (dirty: ...)" in the slurm stdout banner when the working tree has uncommitted changes (wandb's git_commit field stays the clean SHA for clickable GitHub links). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v50 cleared the 1→2-step transition cleanly but hung 30min mid-epoch 20 — classic THP defrag stall (sysadmin's v26 defrag=never had been reverted on gr101/gr102). HPC@ has re-applied defrag=never; the sbatch banner now prints the live THP setting so future post-mortems can confirm at a glance. Model delta vs v50: decoder.context_patches 3 → null. Everything else identical so the v51 vs v50 training curve is directly comparable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v51 trained with hist=1; the existing hist=0 override (added for v48) would double the decoder channel count and crash on state_dict load. Updated: - CKPT_PATH → v51 ema_ckpt.pt (best val 0.240 at epoch 29) - NAME_SUFFIX → kr1_v51_halfdeg_eval_ema - ARGS hist 0 → 1, with a comment explaining the contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Enables resume launches (PREEMPTIBLE=1 WALLTIME=48:00:00 NAME=<existing dir>) without script edits. Also moves EXTRA_SBATCH_ARGS to the end of the sbatch call so explicit user overrides (e.g. --dependency=afterany:<jobid>) take precedence over script defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v51 resume chain reached epoch 48 with val 0.234 (down from 0.240 at the walltime-killed first run, epoch 29). EMA checkpoint at that path has been overwritten in-place during the resume chain — point the eval at the same file with a unique NAME_SUFFIX so we don't clobber the earlier ema_ep29 eval run directory or confuse them in wandb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alxmrs and others added 28 commits May 12, 2026 16:14

KR1 launch/profile scripts and slurm scaffolding

8333328

# Conflicts: # scripts/slurm_apptainer_train.sbatch # Conflicts: # .gitignore # Conflicts: # .gitignore

Compact multiscale zarr data pipeline

c9f9a37

datasets: prefetch, executor sharing, compact path optimizations

0a92e78

# Conflicts: # src/ocean_emulators/datasets.py

Distributed training: Gloo barrier, sampler fixes, metric sync fixes

6f741d0

FOMO multiscale config: v27 batch_size=2 + larger processor

a5a8565

v28: re-launch v27 (prefetch + bs=2 + bigger processor) for GPU util …

7b65273

…baseline

v29: bs=1 to fit 1/4° in 48GB (v28 OOM at 51GB peak)

7004cbc

v30: restore perceiver latent_dim=64/num_latents=256 to raise GPU uti…

7b0bbe5

…l (v29 ~27%)

v31: rollout steps 4->2 to cut per-iter Python overhead (v30 GPU util…

079df74

… ~36%)

v32: widen processor ch_width [380,480,520]->[512,640,768] (v31 util …

7c64a41

…~46%)

v33: DDP no_sync() on accumulation microbatches + grad_accum=2 (v32 r…

ee22927

…egressed to 36%)

v34: gate per-channel all_reduce + wandb log to every 5 iters (v33 st…

1c38a98

…ill ~36%)

v40: launch kr1 multi-scale FOMO run

b83cf72

Fix --data.num_workers flag; bump kr1 to v42

e81b485

Container c4e7738 nests the flag under data.loading.num_workers (train.py rejected --data.num_workers in job 6796531 after argparse validation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bump kr1 to v43 (v42 hit intermittent UCX segfault)

f0a366a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bump kr1 to v44 (resume from v43 after 24h walltime, 19 epochs done)

313b2cb

Bump kr1 to v45 (resume from v43, v44 crashed UCX segfault in 2min)

bc7c20f

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Shrink KR1 halfdeg eval to 4 CPUs / 64G mem

ddb1b46

Group QOS CPU cap is saturated; at 8 CPUs the scheduler estimated a 15h wait. Eval is single-GPU with one batch at a time — 4 dataloader workers is plenty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

alxmrs and others added 5 commits May 12, 2026 16:18

Update the current sbatch script to be preemptable (have correct comm…

d8639e3

…ents + reentry flag) + make sure that the proper exit codes are set (first, to capture the signal code, then to actually exit with the right return code).

Fixes #632.

b02f043

alxmrs force-pushed the u/alxmrs/experiments/kr1-v2 branch from 94cfd59 to f0c1515 Compare May 12, 2026 23:18

alxmrs and others added 24 commits May 15, 2026 17:44

Compact multiscale zarr data pipeline

5a54cc4

Updating training to be preemptable on torch.

552ed0f

This updates the train script, our torch docs and the sbatch script to allow us to make use of preemptable computing resources on torch. This PR is more of a refresh than anything else.

Inter epoch checkpointing flag + updated script to better reflect sba…

b197cce

…tch comments to turn on preemption.

Updated comments on v49 experiment.

35fa799

Add license

1b86d46

Std aggregator.

4d9b7e4

Merge branch 'main' of github.com:Open-Athena/Ocean_Emulator into u/a…

b04792d

…lxmrs/experiments/kr1-v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U/alxmrs/experiments/kr1 v2#732

U/alxmrs/experiments/kr1 v2#732
alxmrs wants to merge 74 commits into
mainfrom
u/alxmrs/experiments/kr1-v2

alxmrs commented May 4, 2026

Uh oh!

alxmrs commented May 4, 2026

Uh oh!

jder commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alxmrs commented May 4, 2026

Uh oh!

alxmrs commented May 4, 2026

Uh oh!

jder commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants