Skip to content

U/alxmrs/experiments/kr1 v2#732

Draft
alxmrs wants to merge 74 commits into
mainfrom
u/alxmrs/experiments/kr1-v2
Draft

U/alxmrs/experiments/kr1 v2#732
alxmrs wants to merge 74 commits into
mainfrom
u/alxmrs/experiments/kr1-v2

Conversation

@alxmrs

@alxmrs alxmrs commented May 4, 2026

Copy link
Copy Markdown
Member

No description provided.

@alxmrs

alxmrs commented May 4, 2026

Copy link
Copy Markdown
Member Author

Looking at wandb, it seems like there are serious bugs in either the rollout or the model:
Screenshot 2026-05-04 at 11 52 56 AM

Screenshot 2026-05-04 at 12 05 37 PM Screenshot 2026-05-04 at 12 07 05 PM

The second image indicates that it is a bug in the rollout specifically. Though, I don't know how _gen plot are calculated. If I understand the third image (the _target), maybe the model is fine but the rollout is busted.

@jder

jder commented May 5, 2026

Copy link
Copy Markdown
Member

@alxmrs I believe "_target" graphs are just "the gold data we are expecting to predict". The "_gen" plots are the (autoregressive) model outputs

alxmrs and others added 28 commits May 12, 2026 16:14
# Conflicts:
#	scripts/slurm_apptainer_train.sbatch

# Conflicts:
#	.gitignore

# Conflicts:
#	.gitignore
# Conflicts:
#	src/ocean_emulators/datasets.py
Add per_scale_batch_size config field threaded through the equivalence
group batch samplers. For the KR1 multi-scale run set [1, 2, 4] across
[¼°, ½°, 1°] so the lower-res scales (which had 27 GB of headroom on
RTX6000) amortize Python/kernel-launch overhead and lift GPU utilization
past the v31 ~46% plateau. ¼° stays at bs=1 (memory-bound). Revert
gradient_accumulation_steps to 1 since v33's no_sync experiment showed
DDP allreduce wasn't the bottleneck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35 with [1,2,4] crashed with CUDA illegal memory access during a gemm
at n=2211840, which corresponds to the 1° batch=4 forward pass through
the FOMO perceiver. Drop the 1° batch back to 2 — still doubles
throughput on ½° and 1° vs. uniform bs=1 without tripping the size
limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35 ([1,2,4]) and v36 ([1,2,2]) both hit a CUDA illegal memory access
right after model construction, before iter 1, with no Python frame in
the C++ stack. Force synchronous launches so the next failure points at
the actual Python call site.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v37 still produced an async-only stack because apptainer strips host
env vars unless prefixed APPTAINERENV_. Use the prefix so the flag
actually reaches the python process inside the container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35-v38 all crashed with CUDA illegal memory access at bs>1 in any
group. CUDA_LAUNCH_BLOCKING (passed via APPTAINERENV_) couldn't pin
the originating kernel because the abort fires inside ~TensorImpl()
during GC, well after the offending kernel was queued — likely a
latent bs>1 bug in FOMO's encoder/perceiver that v34 never exercised.

Reverting per_scale_batch_size to null to ship the v31 baseline. The
sampler infrastructure (config field, group-key→bs mapping, sampler
support for int|list[int]) stays in place for a future debug pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The torchrun invocation inside the apptainer bash -c block was supposed
to be captured into TRAIN_CMD="...", but the opening `TRAIN_CMD=\"` was
dropped in 5a59ac4, leaving only the closing `\"` on line 379. The
dangling quote opened an unterminated string in the inner bash, causing
a syntax error ~line 65 of the inner script and immediate job failure
(see slurm-6428130: 7s failure, never reached Python).

Restore the `TRAIN_CMD=\"` assignment so the later `\${TRAIN_CMD} &`
branches actually have something to invoke. Switch `[@]` → `[*]` since
we're stringifying the array for the variable assignment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Container c4e7738 nests the flag under data.loading.num_workers (train.py
rejected --data.num_workers in job 6796531 after argparse validation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-scale 1/2° rollout evaluation of the multi-scale FOMO model
trained in part 1 (v43, epoch 19). Uses the eval_multiscale machinery
from #652 with halfdeg as the first (evaluated) data source; the other
two scales remain present so the encoder builds with the correct max
grid for the Perceiver rotary positional embedding.

boundary_vars_key matches training config (tau_hfds) so the v43 EMA
checkpoint loads without channel-count mismatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Group QOS CPU cap is saturated; at 8 CPUs the scheduler estimated a
15h wait. Eval is single-GPU with one batch at a time — 4 dataloader
workers is plenty.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default landing partition (rtx6000_lzanna) has GrpCPUs=128 and is
fully consumed by torch_pr_144_general jobs with ~46h remaining. The
unrestricted rtx6000 partition has the same nodes available and 0 CPUs
in use. Explicit pinning gets the eval running immediately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Account is restricted to rtx6000_lzanna for RTX6000 jobs (despite
AllowAccounts=ALL on other partitions, sbatch rejects them with
'partition not valid for this job'). That QOS is fully consumed by
torch_pr_144_general jobs with ~46h remaining.

h200_courant is accessible to our account, has 414 free CPUs out of
768, and a test-submit estimates start in ~2h. Eval is GPU-agnostic
(same checkpoint loads on any CUDA device), so the change is safe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The eval container resolves CONFIG paths to /workspace/configs/...,
which is baked at build time. Adding new configs (e.g. the halfdeg
eval config) required a container rebuild before the job could find
them. Binding ${REPO_DIR}/configs over /workspace/configs lets a
freshly-pulled checkout drive the run without rebuilding.

The training sbatch already does this implicitly via /workspace
binding patterns; this brings eval to parity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alxmrs and others added 5 commits May 12, 2026 16:18
…ents + reentry flag) + make sure that the proper exit codes are set (first, to capture the signal code, then to actually exit with the right return code).
Mask variables (mask_0..mask_18) match _var_name_encode_level's
_[0-9]+ regex and were forcing _is_compact to return False for
compact-form data. Without this filter, DataSource.filter() takes the
non-compact branch which tries data["uo_0"] on a dataset whose only
prognostic variables are uo, vo, thetao, so, zos -> KeyError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a means/stds dataset mixes depth-resolved variables (with a lev
dim) and surface variables, plain to_array().reshape(-1) broadcasts
surface vars across all levels, producing too many channels (95
instead of 77 for thermo_dynamic_all). _flatten_var_lev keeps depth
vars per-level and surface vars as one channel each, matching the
prognostic tensor channel layout.

Restores the fix from 604a047 that 776072b (partial reversion of
data.py) rolled back. Multi-scale 8048549 crashed at the post-Epoch-1
validation unnormalize step without it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-scale loss accumulator stores running sums on CPU
(batch.loss.detach().cpu()) to avoid pinning GPU memory, but get_logs
then calls all_reduce_mean. The default DDP process group is NCCL,
which does not accept CPU tensors. Move the local mean to the current
CUDA device before reducing, then back to CPU for the float cast.

Caught when multi-scale 8048549 ran for 46 minutes through Epoch 1
training + validation, then crashed in get_logs with: RuntimeError
No backend type associated with device type cpu.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alxmrs alxmrs force-pushed the u/alxmrs/experiments/kr1-v2 branch from 94cfd59 to f0c1515 Compare May 12, 2026 23:18
alxmrs and others added 24 commits May 15, 2026 17:44
This updates the train script, our torch docs and the sbatch script to allow us to make use of preemptable computing resources on torch. This PR is more of a refresh than anything else.
Configured directly in YAML (not via CLI overrides):
- model.yaml: pred_residuals false (was true), decoder.context_patches 3 (was 1)
- train_multiscale.yaml: data.hist 0 (was default 1)

Launch script: bump NAME_SUFFIX -> v48, PREEMPTIBLE=1 (use new PR #626
preempt-resume code), wandb group kr1_v48, drop CLI hparam overrides.

Rationale:
- hist=0: ocean state (T,S,u,v) is approximately Markovian; history adds
  channels without strong physics justification.
- pred_residuals=false: v47 1-yr rollout showed sawtooth oscillation that
  amplifies over time, classic failure mode of unstable residual predictor.
- decoder.context_patches=3: time-mean thetao plot from v47 rollout shows
  visible decoder-window seam (~72x120 px rectangle). Bumping context
  rings 1->3 grows window data context from 14x14=196 -> 18x18=324 latent
  tokens, sharing more context across adjacent windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 sbatch dropped the src+configs binds; without them the container
uses its stale baked-in code, so source-only changes (incl. the preempt
code we just added) wouldnt take effect. Restore the bind pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 added a bind for ${RUN_DIR}/wandb but didnt create it; apptainer
errors with "mount source doesnt exist" at startup, before the trainer
runs (so preempt-resume cant kick in).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 attempted to set Comment="preemption=yes;requeue=true" via
scontrol update from inside the job, but slurm rejects this for non-admin
users with "Unspecified error". The comment must be set at submit time
to opt into the Torch clusters preemption-friendly partition behavior.
Without it, slurm cancels (not requeues) on QOS-priority preemption,
which is what killed v47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lab-priority partition rtx6000_lzanna has PreemptMode=OFF — meaning
slurm wont auto-requeue cancelled jobs even with --requeue + Comment.
The shared rtx6000 partition is configured with PreemptMode=REQUEUE
(verified via scontrol show partition). Switching there sacrifices a
small amount of priority for working auto-requeue on preemption.

With intra-epoch checkpoints (CHECKPOINT_BATCH_INTERVAL=100, ~3 min
cadence), worst-case loss per preemption is ~3 min of compute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The launch script exports NCCL_P2P_DISABLE, NCCL_IB_DISABLE, UCX_TLS,
NCCL_NET etc. as workarounds for gr101/gr102 NCCL/UCX segfaults.
Apptainer drops these by default; the APPTAINERENV_ prefix forwards
them into the container.

Without this, 8237766 ran 2h on first attempt (lucky), got preempted,
auto-requeued (Restarts=1, working as designed), but segfaulted 14s
into the resumed attempt at NCCL init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
steps: [1, 2, 4, 8] with step_transition: [5, 15, 30]:
  epochs 1-4   → 1-step rollout (initial fitting)
  epochs 5-14  → 2-step
  epochs 15-29 → 4-step
  epochs 30-70 → 8-step

Inherits v48's other deltas (hist=0, pred_residuals=false,
decoder.context_patches=3). Saved as a separate file so v48's
in-flight runs that re-read train_multiscale.yaml on each requeue
are unaffected.

Rationale: 1-step training only ever sees ground-truth as input,
but rollout sees the model's own predictions; backprop through
multi-step composition forces f's iterated Jacobian to have stable
spectral properties. Critical for our 10yr / ~730-step rollouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conditional APPTAINERENV_ propagation worked on first launch but not
on slurm requeues — slurm doesnt reliably re-export user-shell vars when
restarting a job. Both v48 attempts that auto-requeued segfaulted at
NCCL/CUDA init for this reason.

Hardcode the gr101/gr102 NCCL workarounds directly in the sbatch script
so they survive every requeue regardless of submitting-shell state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ise)

slurm_apptainer_train.sbatch computes NAME=$(date +%Y-%m-%d)-$NAME_SUFFIX
when NAME is unset. Each requeue or chain-link runs the sbatch fresh and
recomputes NAME from todays date, creating a NEW run dir per day and
losing access to the previous checkpoint.

Fix: launch_kr1_train.sh now exports NAME once at submission, fixing the
date for the entire lifetime of the chain. Override via NAME=... env var
to resume into a specific existing dir.

This caused us to silently lose epoch 10 progress (val 0.255) when the
chain link ran on a different day from submission and started fresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
h200_courant is contended (QOSGrpGRES kills evals at ~35s before
rollout begins). docs/torch.md examples use rtx6000_lzanna with
gres=gpu:rtx6000:1 (the lab-priority partition); follow that pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training used hist=0; without this override the eval model is built
with the default hist=1, doubling decoder channels (77 -> 154) and
failing state_dict load with shape mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v49 NaN'd at the 1->2-step curriculum transition (epoch 20, iter 1) with
pred_residuals=true. Going conservative for v50: residuals=false to avoid
the v47 sawtooth, keep hist=1 to avoid v48's climatology collapse, restore
v48's LR.

PerScaleSnapshotValidateAggregator now also routes each scale through a
StdRatioAggregator so val/{H}x{W}/std_ratio/{var} actually surfaces under
multi-scale FOMO training (it was a no-op in v49 because we only wired it
into the base ValidateAggregator path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PhysicsNeMo container bakes WANDB_GIT_COMMIT/WANDB_GIT_REMOTE_URL into
the image at build time and also sets WANDB_DISABLE_GIT=true, so wandb
forever reports the SHA from the image build rather than the host commit
that's actually executing (via the bind-mounted src/ + configs/).

Override both env vars from the host's REPO_DIR via APPTAINERENV_* so wandb
labels each run with the real running commit. Appends " (dirty: ...)"
in the slurm stdout banner when the working tree has uncommitted changes
(wandb's git_commit field stays the clean SHA for clickable GitHub links).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v50 cleared the 1→2-step transition cleanly but hung 30min mid-epoch 20 —
classic THP defrag stall (sysadmin's v26 defrag=never had been reverted on
gr101/gr102). HPC@ has re-applied defrag=never; the sbatch banner now
prints the live THP setting so future post-mortems can confirm at a glance.

Model delta vs v50: decoder.context_patches 3 → null. Everything else
identical so the v51 vs v50 training curve is directly comparable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v51 trained with hist=1; the existing hist=0 override (added for v48) would
double the decoder channel count and crash on state_dict load. Updated:
- CKPT_PATH → v51 ema_ckpt.pt (best val 0.240 at epoch 29)
- NAME_SUFFIX → kr1_v51_halfdeg_eval_ema
- ARGS hist 0 → 1, with a comment explaining the contract

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enables resume launches (PREEMPTIBLE=1 WALLTIME=48:00:00 NAME=<existing dir>)
without script edits. Also moves EXTRA_SBATCH_ARGS to the end of the sbatch
call so explicit user overrides (e.g. --dependency=afterany:<jobid>) take
precedence over script defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v51 resume chain reached epoch 48 with val 0.234 (down from 0.240 at the
walltime-killed first run, epoch 29). EMA checkpoint at that path has been
overwritten in-place during the resume chain — point the eval at the same
file with a unique NAME_SUFFIX so we don't clobber the earlier ema_ep29 eval
run directory or confuse them in wandb.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants