U/alxmrs/experiments/kr1 v2#732
Draft
alxmrs wants to merge 74 commits into
Draft
Conversation
Member
Author
Member
|
@alxmrs I believe "_target" graphs are just "the gold data we are expecting to predict". The "_gen" plots are the (autoregressive) model outputs |
# Conflicts: # scripts/slurm_apptainer_train.sbatch # Conflicts: # .gitignore # Conflicts: # .gitignore
# Conflicts: # src/ocean_emulators/datasets.py
Add per_scale_batch_size config field threaded through the equivalence group batch samplers. For the KR1 multi-scale run set [1, 2, 4] across [¼°, ½°, 1°] so the lower-res scales (which had 27 GB of headroom on RTX6000) amortize Python/kernel-launch overhead and lift GPU utilization past the v31 ~46% plateau. ¼° stays at bs=1 (memory-bound). Revert gradient_accumulation_steps to 1 since v33's no_sync experiment showed DDP allreduce wasn't the bottleneck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35 with [1,2,4] crashed with CUDA illegal memory access during a gemm at n=2211840, which corresponds to the 1° batch=4 forward pass through the FOMO perceiver. Drop the 1° batch back to 2 — still doubles throughput on ½° and 1° vs. uniform bs=1 without tripping the size limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35 ([1,2,4]) and v36 ([1,2,2]) both hit a CUDA illegal memory access right after model construction, before iter 1, with no Python frame in the C++ stack. Force synchronous launches so the next failure points at the actual Python call site. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v37 still produced an async-only stack because apptainer strips host env vars unless prefixed APPTAINERENV_. Use the prefix so the flag actually reaches the python process inside the container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v35-v38 all crashed with CUDA illegal memory access at bs>1 in any group. CUDA_LAUNCH_BLOCKING (passed via APPTAINERENV_) couldn't pin the originating kernel because the abort fires inside ~TensorImpl() during GC, well after the offending kernel was queued — likely a latent bs>1 bug in FOMO's encoder/perceiver that v34 never exercised. Reverting per_scale_batch_size to null to ship the v31 baseline. The sampler infrastructure (config field, group-key→bs mapping, sampler support for int|list[int]) stays in place for a future debug pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The torchrun invocation inside the apptainer bash -c block was supposed to be captured into TRAIN_CMD="...", but the opening `TRAIN_CMD=\"` was dropped in 5a59ac4, leaving only the closing `\"` on line 379. The dangling quote opened an unterminated string in the inner bash, causing a syntax error ~line 65 of the inner script and immediate job failure (see slurm-6428130: 7s failure, never reached Python). Restore the `TRAIN_CMD=\"` assignment so the later `\${TRAIN_CMD} &` branches actually have something to invoke. Switch `[@]` → `[*]` since we're stringifying the array for the variable assignment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Container c4e7738 nests the flag under data.loading.num_workers (train.py rejected --data.num_workers in job 6796531 after argparse validation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-scale 1/2° rollout evaluation of the multi-scale FOMO model trained in part 1 (v43, epoch 19). Uses the eval_multiscale machinery from #652 with halfdeg as the first (evaluated) data source; the other two scales remain present so the encoder builds with the correct max grid for the Perceiver rotary positional embedding. boundary_vars_key matches training config (tau_hfds) so the v43 EMA checkpoint loads without channel-count mismatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Group QOS CPU cap is saturated; at 8 CPUs the scheduler estimated a 15h wait. Eval is single-GPU with one batch at a time — 4 dataloader workers is plenty. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default landing partition (rtx6000_lzanna) has GrpCPUs=128 and is fully consumed by torch_pr_144_general jobs with ~46h remaining. The unrestricted rtx6000 partition has the same nodes available and 0 CPUs in use. Explicit pinning gets the eval running immediately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Account is restricted to rtx6000_lzanna for RTX6000 jobs (despite AllowAccounts=ALL on other partitions, sbatch rejects them with 'partition not valid for this job'). That QOS is fully consumed by torch_pr_144_general jobs with ~46h remaining. h200_courant is accessible to our account, has 414 free CPUs out of 768, and a test-submit estimates start in ~2h. Eval is GPU-agnostic (same checkpoint loads on any CUDA device), so the change is safe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The eval container resolves CONFIG paths to /workspace/configs/...,
which is baked at build time. Adding new configs (e.g. the halfdeg
eval config) required a container rebuild before the job could find
them. Binding ${REPO_DIR}/configs over /workspace/configs lets a
freshly-pulled checkout drive the run without rebuilding.
The training sbatch already does this implicitly via /workspace
binding patterns; this brings eval to parity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ents + reentry flag) + make sure that the proper exit codes are set (first, to capture the signal code, then to actually exit with the right return code).
Mask variables (mask_0..mask_18) match _var_name_encode_level's _[0-9]+ regex and were forcing _is_compact to return False for compact-form data. Without this filter, DataSource.filter() takes the non-compact branch which tries data["uo_0"] on a dataset whose only prognostic variables are uo, vo, thetao, so, zos -> KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a means/stds dataset mixes depth-resolved variables (with a lev dim) and surface variables, plain to_array().reshape(-1) broadcasts surface vars across all levels, producing too many channels (95 instead of 77 for thermo_dynamic_all). _flatten_var_lev keeps depth vars per-level and surface vars as one channel each, matching the prognostic tensor channel layout. Restores the fix from 604a047 that 776072b (partial reversion of data.py) rolled back. Multi-scale 8048549 crashed at the post-Epoch-1 validation unnormalize step without it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-scale loss accumulator stores running sums on CPU (batch.loss.detach().cpu()) to avoid pinning GPU memory, but get_logs then calls all_reduce_mean. The default DDP process group is NCCL, which does not accept CPU tensors. Move the local mean to the current CUDA device before reducing, then back to CPU for the float cast. Caught when multi-scale 8048549 ran for 46 minutes through Epoch 1 training + validation, then crashed in get_logs with: RuntimeError No backend type associated with device type cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94cfd59 to
f0c1515
Compare
This updates the train script, our torch docs and the sbatch script to allow us to make use of preemptable computing resources on torch. This PR is more of a refresh than anything else.
…tch comments to turn on preemption.
Configured directly in YAML (not via CLI overrides): - model.yaml: pred_residuals false (was true), decoder.context_patches 3 (was 1) - train_multiscale.yaml: data.hist 0 (was default 1) Launch script: bump NAME_SUFFIX -> v48, PREEMPTIBLE=1 (use new PR #626 preempt-resume code), wandb group kr1_v48, drop CLI hparam overrides. Rationale: - hist=0: ocean state (T,S,u,v) is approximately Markovian; history adds channels without strong physics justification. - pred_residuals=false: v47 1-yr rollout showed sawtooth oscillation that amplifies over time, classic failure mode of unstable residual predictor. - decoder.context_patches=3: time-mean thetao plot from v47 rollout shows visible decoder-window seam (~72x120 px rectangle). Bumping context rings 1->3 grows window data context from 14x14=196 -> 18x18=324 latent tokens, sharing more context across adjacent windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 sbatch dropped the src+configs binds; without them the container uses its stale baked-in code, so source-only changes (incl. the preempt code we just added) wouldnt take effect. Restore the bind pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 added a bind for ${RUN_DIR}/wandb but didnt create it; apptainer errors with "mount source doesnt exist" at startup, before the trainer runs (so preempt-resume cant kick in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #626 attempted to set Comment="preemption=yes;requeue=true" via scontrol update from inside the job, but slurm rejects this for non-admin users with "Unspecified error". The comment must be set at submit time to opt into the Torch clusters preemption-friendly partition behavior. Without it, slurm cancels (not requeues) on QOS-priority preemption, which is what killed v47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lab-priority partition rtx6000_lzanna has PreemptMode=OFF — meaning slurm wont auto-requeue cancelled jobs even with --requeue + Comment. The shared rtx6000 partition is configured with PreemptMode=REQUEUE (verified via scontrol show partition). Switching there sacrifices a small amount of priority for working auto-requeue on preemption. With intra-epoch checkpoints (CHECKPOINT_BATCH_INTERVAL=100, ~3 min cadence), worst-case loss per preemption is ~3 min of compute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The launch script exports NCCL_P2P_DISABLE, NCCL_IB_DISABLE, UCX_TLS, NCCL_NET etc. as workarounds for gr101/gr102 NCCL/UCX segfaults. Apptainer drops these by default; the APPTAINERENV_ prefix forwards them into the container. Without this, 8237766 ran 2h on first attempt (lucky), got preempted, auto-requeued (Restarts=1, working as designed), but segfaulted 14s into the resumed attempt at NCCL init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
steps: [1, 2, 4, 8] with step_transition: [5, 15, 30]: epochs 1-4 → 1-step rollout (initial fitting) epochs 5-14 → 2-step epochs 15-29 → 4-step epochs 30-70 → 8-step Inherits v48's other deltas (hist=0, pred_residuals=false, decoder.context_patches=3). Saved as a separate file so v48's in-flight runs that re-read train_multiscale.yaml on each requeue are unaffected. Rationale: 1-step training only ever sees ground-truth as input, but rollout sees the model's own predictions; backprop through multi-step composition forces f's iterated Jacobian to have stable spectral properties. Critical for our 10yr / ~730-step rollouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The conditional APPTAINERENV_ propagation worked on first launch but not on slurm requeues — slurm doesnt reliably re-export user-shell vars when restarting a job. Both v48 attempts that auto-requeued segfaulted at NCCL/CUDA init for this reason. Hardcode the gr101/gr102 NCCL workarounds directly in the sbatch script so they survive every requeue regardless of submitting-shell state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ise) slurm_apptainer_train.sbatch computes NAME=$(date +%Y-%m-%d)-$NAME_SUFFIX when NAME is unset. Each requeue or chain-link runs the sbatch fresh and recomputes NAME from todays date, creating a NEW run dir per day and losing access to the previous checkpoint. Fix: launch_kr1_train.sh now exports NAME once at submission, fixing the date for the entire lifetime of the chain. Override via NAME=... env var to resume into a specific existing dir. This caused us to silently lose epoch 10 progress (val 0.255) when the chain link ran on a different day from submission and started fresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
h200_courant is contended (QOSGrpGRES kills evals at ~35s before rollout begins). docs/torch.md examples use rtx6000_lzanna with gres=gpu:rtx6000:1 (the lab-priority partition); follow that pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training used hist=0; without this override the eval model is built with the default hist=1, doubling decoder channels (77 -> 154) and failing state_dict load with shape mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v49 NaN'd at the 1->2-step curriculum transition (epoch 20, iter 1) with
pred_residuals=true. Going conservative for v50: residuals=false to avoid
the v47 sawtooth, keep hist=1 to avoid v48's climatology collapse, restore
v48's LR.
PerScaleSnapshotValidateAggregator now also routes each scale through a
StdRatioAggregator so val/{H}x{W}/std_ratio/{var} actually surfaces under
multi-scale FOMO training (it was a no-op in v49 because we only wired it
into the base ValidateAggregator path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lxmrs/experiments/kr1-v2
The PhysicsNeMo container bakes WANDB_GIT_COMMIT/WANDB_GIT_REMOTE_URL into the image at build time and also sets WANDB_DISABLE_GIT=true, so wandb forever reports the SHA from the image build rather than the host commit that's actually executing (via the bind-mounted src/ + configs/). Override both env vars from the host's REPO_DIR via APPTAINERENV_* so wandb labels each run with the real running commit. Appends " (dirty: ...)" in the slurm stdout banner when the working tree has uncommitted changes (wandb's git_commit field stays the clean SHA for clickable GitHub links). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v50 cleared the 1→2-step transition cleanly but hung 30min mid-epoch 20 — classic THP defrag stall (sysadmin's v26 defrag=never had been reverted on gr101/gr102). HPC@ has re-applied defrag=never; the sbatch banner now prints the live THP setting so future post-mortems can confirm at a glance. Model delta vs v50: decoder.context_patches 3 → null. Everything else identical so the v51 vs v50 training curve is directly comparable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v51 trained with hist=1; the existing hist=0 override (added for v48) would double the decoder channel count and crash on state_dict load. Updated: - CKPT_PATH → v51 ema_ckpt.pt (best val 0.240 at epoch 29) - NAME_SUFFIX → kr1_v51_halfdeg_eval_ema - ARGS hist 0 → 1, with a comment explaining the contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Enables resume launches (PREEMPTIBLE=1 WALLTIME=48:00:00 NAME=<existing dir>) without script edits. Also moves EXTRA_SBATCH_ARGS to the end of the sbatch call so explicit user overrides (e.g. --dependency=afterany:<jobid>) take precedence over script defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v51 resume chain reached epoch 48 with val 0.234 (down from 0.240 at the walltime-killed first run, epoch 29). EMA checkpoint at that path has been overwritten in-place during the resume chain — point the eval at the same file with a unique NAME_SUFFIX so we don't clobber the earlier ema_ep29 eval run directory or confuse them in wandb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



No description provided.