Major update: new models, MoE performance, LoRA, and optimizers#9
Major update: new models, MoE performance, LoRA, and optimizers#9
Conversation
Squashed port of internal PR #24 (mabraham/grpo): - Add dr-grpo loss - add items back - rename metrics log - Add tests and rename - Adjust tests - import assert
All-reduce tokens_per_expert across the data-parallel group before computing the Switch Transformer auxiliary loss so the loss reflects the global routing distribution, not just the local micro-batch. Closes #39 (cherry picked from commit dd602d1ae4702dfaf37b14922cbf91ebfcbc0e07)
(cherry picked from commit ce41728320ba988724c95653b426c81ee6074b7b)
(cherry picked from commit 69ee83a5f59c8eaf8bc61f998f6d8808cc91bd0e)
* Fix TFLOPs accounting and improve wandb/startup metrics - Align MoE TFLOPs with logical model FLOPs and fix per-document seqlen² cost calculation in helper.py - Move wandb init earlier in bootstrap for startup metric capture - Add host inventory logging and memory snapshots during setup - Track repo commit hash in experiment metadata - Show tok/s instead of tflops in training progress bar * Fix TFLOPs overcounting: embedding is a lookup, not a matmul embed_lm_N was computed as vocab_size * hidden_size * 2, counting both the embedding table and lm_head as matmuls. But embedding forward is a table lookup (zero multiply-accumulate ops) — only lm_head is a real matmul. This matches torchtitan's approach of explicitly excluding nn.Embedding parameters from the FLOPs count. Fix: change embed_lm_N = vocab_size * hidden_size * 2 to embed_lm_N = vocab_size * hidden_size (lm_head only) For Qwen3-32B at seq_len=128k, this removes ~600 TFLOPS of phantom FLOPs per step, giving more accurate MFU reporting. Applied across all model types: deepseek_v3, qwen3_moe, qwen3_5, qwen3_5_moe, qwen2/qwen3/llama, qwen2_vl. (cherry picked from commit 4abe7983ba02b15a3526c80c93d1d936764396e0)
Bump max_retries from 3 to 30 and sleep from 2s to 5s to handle slow NFS propagation across nodes in multi-node training setups. (cherry picked from commit c05f962cdee2632c46321abd477833612112a4d7)
* Add pre-commit hooks and apply formatting across codebase - Add .pre-commit-config.yaml with trailing-whitespace, EOF fixer, YAML/TOML checks, ruff (fix-only + format), and codespell hooks - Add .codespellrc with ignore list for ML domain terms (dout, te, etc.) - Apply ruff formatting and import sorting to all source files - Fix typos caught by codespell (orgin, paramters, gurantee, etc.) * Add CI lint workflow for ruff and codespell * Simplify CI lint to reuse pre-commit config * Update CONTRIBUTING.md and docs with pre-commit setup instructions * Add pre-commit cache to CI lint workflow * Add Claude code review workflow for PRs * Use API key auth for Claude code review (no app install needed) * Fix permissions: allow writing PR comments for code review * test: trigger claude code review * Add workflow_dispatch trigger for manual testing * Rename lint.yml to pre-commit.yml * Add auto-labeler and cancel-on-merge workflows * Simplify claude code review: use direct prompt instead of plugin * Add Claude Code GitHub Action workflow * Remove workflow files moved to separate PR (#66) claude-code-review, cancel-on-merge, and labeler workflows are now in the qywu/add-github-workflows branch. * Remove claude.yml, moved to PR #66 * Pin pre-commit and action versions, use .[lint] for installation - Pin pre-commit>=4.5,<5 in pyproject.toml lint dependency group - Use `pip install -e ".[lint]"` in CONTRIBUTING.md and CI workflow - Pin GitHub Actions to commit SHAs for reproducibility - Rename pre-commit.yml to lint.yml to match workflow name * Use uv to install pre-commit in CI instead of pip install -e .[lint] Avoids installing the full xorl package (torch, flash-attn, etc.) on GitHub runners just to run linting. Uses uv tool install for a fast, isolated pre-commit installation. (cherry picked from commit 921c56747747d4f25e4ab2012a0ac254a060d39c)
* Improve dummy dataset generation and packing - Update dummy data pipeline for better benchmark coverage - Improve dataset packing logic for local testing * Fix pre-commit formatting in packing.py (cherry picked from commit 96e918147fc6e2c23d13f96d601f42437b67beec)
* Fuse MoE expert gate and up projections * Fix EP MoE weight sync gathering * Fix weight sync readiness tracking (cherry picked from commit 3ade637fe432c005f1b613a6d56423e1bb7d17fd)
* Improve RMSNorm modes and fused residual path (#37) * Improve RMSNorm modes and fused residual path * Support native and compile RMSNorm for Qwen3.5 * Improve RMSNorm benchmark and Qwen3.5 fusion * Remove quack RMSNorm mode support * Update server docs for rmsnorm_mode * Remove stale quack benchmark tags * Fix pre-commit lint and formatting issues * Fix nested config flattening to propagate all fields The load_server_arguments() nested-config path previously cherry-picked a hardcoded subset of model/train keys, silently dropping fields like rmsnorm_mode, ep_dispatch, router_fp32, and all lora config. Replace with generic flattening that copies all keys from each section, with explicit remapping only for worker.* prefixes and lora.exclude_modules -> qlora_exclude_modules. New fields added to ServerArguments + to_config_dict() are now picked up automatically. * Add missing fields to to_config_dict() for complete round-trip model_name, pp_variable_seq_lengths, log_level, and sync_inference_method were not emitted by to_config_dict(), causing them to be silently lost when round-tripping through nested YAML configs via load_server_arguments(). (cherry picked from commit a77acd4802f9f2121a55dedd9d578997e2834ab3)
…#82) * Use rank-local node cache dirs for Triton and Quack * Share rank-local cache setup across entrypoints (cherry picked from commit 49af7821117750a1d84b8daf59c1bd2b9822d9af)
* DeepEP improvements: train_router fix, NFS retry, benchmark infra (#41) * Add local benchmark configs/jobs and improve DeepEP async combine Benchmark infra: - Add 1-node and 2-node EP16 benchmark configs for alltoall and deepep - Fix 2-node k8s jobs: POD_NAME downward API for correct node_rank, hostIPC, privileged, convert StatefulSet to two coordinated Jobs (backoffLimit=0) - Set rmsnorm_mode=compile in all benchmark configs DeepEP improvements: - Default async_combine=True to overlap combine comm with shared expert compute - scatter_add_ → index_add_ in unpermute step - Remove redundant .contiguous() before combine * Revert DeepEP async_combine changes (slower on H100) * Add 4-node EP32 Qwen3-235B-A22B benchmark configs (deepep + alltoall) * Default train_router=False for MoE, fix packing NFS retry, add benchmark scripts - Set train_router default to False so routing weights are detached before expert computation. Router is trained only via auxiliary losses, ensuring consistent convergence between DeepEP and AllToAll dispatch backends. DeepEP's custom autograd cannot propagate gradients through routing weights (buffer communication breaks the autograd graph), causing worse convergence when train_router=True. - Increase packing bins cache retry from 10x2s to 30x5s for multi-node NFS visibility delays. - Add local benchmark configs and slurm scripts for 30B 2-node and 235B 4-node benchmarks (deepep, alltoall, split_broadcast, A/B test). - Add NFS parallel read throughput benchmark (1-rank and 32-rank variants). - Add DeepEP vs AllToAll precision comparison test. * Assert train_router=False in MoE forward DeepEP cannot propagate gradients through routing weights, so train_router=True would silently produce different convergence between dispatch backends. Fail fast instead. * Remove unnecessary benchmark files and test configs Clean up AB test configs, split_broadcast variants, NFS bench scripts, and precision test that were used during development. * Remove benchmark results from repo * Move slurm scripts to experiments/local_benchmark/slurm/ * Remove packing.py NFS retry change (moved to separate PR) * Remove experiments/ (moved to local-benchmark PR) * Restore experiments/ to match main * Make train_router explicit for DeepEP * Align train_router defaults with PR intent --------- Co-authored-by: kiddyboots216 <apanda@together.ai> (cherry picked from commit 956da30f6542e4f85d700241d1ca871c4e4a1f0b)
(cherry picked from commit e3f720dcb52f87f997be264941c20a1b46c4f1c5)
* Respect deferred QLoRA MoE expert loading * Format qwen3 moe checkpoint handler * Filter skipped weights in rank0 broadcast load path * Update broadcast loader tests for weight-load group (cherry picked from commit 0184c0a8feb27e7dd18c1937d4d774a3d526210d)
* Add SignSGD optimizer support * Remove import churn from SignSGD PR * Trim unrelated SignSGD diff (cherry picked from commit 2607fc6db06cfa09244610562ef35ae1c885c555)
* Move MoE routing weights before down_proj * Fix EP routing score dispatch and gradients * Apply Ruff formatting for EP routing score changes (cherry picked from commit 67a2c921544fee18534343e234985a63b343657e)
* feat: add Qwen3.5 MoE support and fix multi-node training stability - Add Qwen3.5-35B-A3B and 397B-A17B server configs (full + LoRA) - Fix weight loading deadlock: remove Gloo broadcast from _try_load_state_dict - Fix MoE numeric stability: FP32 scatter_add accumulation, chunked DeepEP combine - Fix Ulysses SP position_ids generation for correct RoPE on ranks > 0 - Handle variable experts-per-tok in routing replay (Qwen3.5 uses topk=10) - Fix LoRA adapter DTensor shard boundary, add fast checkpoint save path - Increase timeouts for large model init and weight sync * Fix fused gate_up_proj OOM: fused forward GEMM + in-place backward dgrad + Muon split MoE memory optimization for fused gate_up_proj [E, H, 2I]: - Fused forward GEMM: single x @ gate_up_proj instead of 2 separate GEMMs, zero .contiguous() copies, save_for_backward stores param reference (zero-copy) - In-place backward dgrad: accumulate 2nd dgrad into grad_permute_tokens via c= parameter (ACCUMULATE_TO_C), avoiding 7 GiB temp allocation - Kernel fix: cast ACCUMULATE_TO_C loaded values to float32 for tl.dot compat - Muon split: detect fused [E, H, 2I] params and run Newton-Schulz on each [E, H, I] half independently (4x faster, same math) - Warmup: pre-compile ACCUMULATE_TO_C=True kernel variant before training - EP compute: fused interface (gate_up_proj, down_proj, intermediate_size), quack/native adapters for backward compat - Checkpoint handler: fix internal pattern transpose for HF Qwen3.5 weights - Config: moe_act mode, no autotune, 60min NCCL timeout * Adjust group gemm to avoid explicit memory copy * Fuse EP backward GEMMs and eliminate .contiguous() weight copies Backward FC1: 4 split GEMMs + 2 weight copies → 2 fused GEMMs. Expand kernel warmup to all backward variants to avoid compilation OOM. Add XORL_TRITON_NO_AUTOTUNE=1 to skip autotune benchmarking on large models. * Fix ruff formatting * Fix review feedback: LoRA save gather bug and Muon fused_split heuristic - Remove incorrect EP gather in LoRA checkpoint save: lora_params already stores global-shape tensors, gather was duplicating data and dst=0 as world rank could hang on non-root EP groups. - Replace shape-based fused_split heuristic with explicit param name check: build _fused_gate_up_ids set from param names containing "gate_up_proj" as @kiddyboots216 suggest. - add 397B benchmark config * Fix ruff-format: break long lines in optimizer.py * Fix train_router not propagated to MoE subclasses, causing crash with deepep and fix issued mentioned by Ashwinee * Merge main: resolve MoE conflicts, combine fused gate_up with expert_scores (cherry picked from commit b5631452f53de8231784cc3047267ebd5b6d8eec)
* Enforce import-outside-top-level with targeted lazy imports * Use real imports for PLC0415 fixes * Move safe PLC0415 imports to module scope * Apply ruff formatting * Make pyiceberg warning import non-optional * Revert "Make pyiceberg warning import non-optional" This reverts commit 833fc321cdc917349f693243e8994f8b4a06ee36. * Fix Ruff import spacing in direct_train (cherry picked from commit c02dbf69e91f6bcb277e13915b8d3ec939cfc84c)
* Add moe_shared_lora and moe_hybrid_shared_lora to offline Trainer The MoE-aware LoRA injection (shared and hybrid_shared modes) was only available through the server's model_builder path. This adds the same support to the offline Trainer and direct_train so YAML configs can use: lora: moe_shared_lora: true # all experts share both lora_A and lora_B moe_hybrid_shared_lora: true # gate/up share lora_A, down shares lora_B Changes: - Add moe_shared_lora / moe_hybrid_shared_lora fields to LoRAArguments - Trainer._inject_lora: route to inject_lora_into_model_with_moe when flags are set - direct_train: same injection logic - Both save paths pass moe_hybrid_shared_lora to save_lora_checkpoint * Remove unsupported moe_shared_lora mode * Address PR #63 review: docs sync and load_lora_checkpoint docstring - Remove stale moe_shared_lora references from server.md and lora.mdx - Document that load_lora_checkpoint does not support hybrid-shared MoE LoRA checkpoints (key remapping + re-stacking not implemented) * Fix ruff-format: line wrap and blank line * Remove unused imports flagged by ruff --------- Co-authored-by: kiddyboots216 <apanda@together.ai> (cherry picked from commit c4829b24d574cc314670057a9699e822f6c80c63)
(cherry picked from commit df96b96182d022fe0ccbcf2bba39ef85c0596bb6)
(cherry picked from commit 51a2879488913b87dce85b15009e35f2af590b13)
* Fix package import cycles for LoRA and runner modules * Fix missing Optional import in weight sync backend (cherry picked from commit 1b654738a6cf229aabe28e2b40523d91465f3f12)
…EADME (#57) * docs: add DeepEP multi-node prerequisites and cluster setup - Document nvidia_peermem requirement for GPUDirect RDMA — without it NVSHMEM cannot register GPU buffers with IB HCAs and DeepEP crashes with SIGABRT at the first dispatch - Document IBGDA driver settings (NVreg_EnableStreamMemOPs, PeerMappingOverride) with initramfs rebuild and reboot steps - Add troubleshooting entries for SIGABRT/num_recv_tokens=-1 and IBGDA init failures - Add 2-node EP16 dummy benchmark configs (deepep and alltoall) - Add slurm_train_30b_2node.sh launch script with corrected NVSHMEM lib path detection (use __path__ instead of __file__ for namespace packages) - Add enable_ibgda_reboot.sh for per-node IBGDA kernel module setup * docs: remove DeepEP install section from README * docs: consolidate DeepEP install and multi-node prerequisites under single section * docs: point DeepEP install to upstream repo instead of internal wheel * docs: remove multi-node setup section from installation page * docs: add link to installation guide in quickstart * docs: simplify README, link to docs for features and details * revert: remove examples and scripts from PR * docs: use standard GitHub Pages URL (org/repo format) * docs: update README — XoRL logo, remove tagline * docs: add overview section and improve README layout * docs: add emojis to README section headings * docs: fix clone URL to togethercomputer/xorl-internal * docs: rename DEVELOPMENT.md to CONTRIBUTING.md, add contributing section to README * docs: add supported models section to README * docs: add emojis to header nav links * add xorl-client and xorl-sglang as git submodules * docs: improve installation guide, add xorl-client and xorl-sglang documentation - Add xorl stack overview table (xorl, xorl-client, xorl-sglang) to README - Add conda and uv install options for both base and sglang environments - Add pyproject.sglang.toml for unified install with PyTorch 2.9.1 - Add xorl-client as a git dependency in pyproject.toml - Expand xorl-client docs: client classes, key features, quick example - Expand xorl-sglang docs: detailed diff from upstream SGLang covering weight sync endpoints, MoE routing export, numerical alignment, batch-invariant mode, and bug fixes - Update submodule install instructions to reference submodules/ paths * docs: fix repo table column spacing in README * docs: add server training examples (no_robot_sft, password_memorization) * docs: add full stack launch guide for training + inference servers * docs: add RL loop illustration and rename to "Server Training for RL" * docs: replace ASCII diagram with HTML illustration for RL architecture * docs: reorganize server training section into sub-sections - Add RL architecture SVG diagram - Reorganize sidebar: Training Server, Client SDK, Examples as sub-sections - Move architecture.mdx -> training-server/launching.mdx - Move api-reference.md -> training-server/api-reference.md - Split client-sdk into overview, training-loop, loss-functions pages - Split examples into sft-no-robots, password-memorization pages - Merge rl-training.mdx content into client-sdk pages - Shorten overview to concise entry point with links to sub-pages * docs: add recommended reading section with RL and system design papers * Fix doc hallucinations: nonexistent config, invalid loss_fn, wrong API pattern, port inconsistency * Fix server training docs examples * Fix remaining server training docs issues --------- Co-authored-by: zzz0906 <zzz879978@outlook.com> Co-authored-by: Ashwinee Panda <apanda@together.ai> (cherry picked from commit 6e0d8d440a3b5cb975dc8a732d855e68c7e39b39)
* Fix nccl weight sync rendezvous reuse * Default weight sync rendezvous to ephemeral ports (cherry picked from commit 101334268440cc57b874c41c7a2c3b2f18ac28cf)
(cherry picked from commit c49840b4e90ca895de3b224eeed0654e9a6c80a6)
* Fix causal attention FLOP accounting * Remove unsupported FLOPs model aliases (cherry picked from commit cd140158137790ac61f9466c8dab3ebc5564aaad)
* Add Gram Newton-Schulz backend for Muon (#97) * Add Gram Newton-Schulz backend for Muon * Fix QLoRA import cycles for benchmark entrypoints * Prefer installed quack for Muon Gram-NS * Use upstream quack for Muon Gram-NS * Batch Gram Newton-Schulz updates in Muon * Add Muon dtype controls and split optimizer modules * Add Muon transient update dtype control * Add Muon force momentum path flag * Fix Muon dtype handling for server and Gram-NS * Fallback to torch matmuls for unsupported Quack dtypes * Add Gram Newton-Schulz backend for Muon (#97) * Add Gram Newton-Schulz backend for Muon * Fix QLoRA import cycles for benchmark entrypoints * Prefer installed quack for Muon Gram-NS * Use upstream quack for Muon Gram-NS * Batch Gram Newton-Schulz updates in Muon * Reconcile Muon Gram-NS batching changes * Format Muon batched NS list comprehension * Batch grouped Gram-NS by matrix shape (cherry picked from commit 37205d38c4f51053b33a42970264a0e04ec900f4)
* Fix Qwen3.5-35B MoE routing and fused_recurrent kwargs bug Two fixes: 1. **norm_topk_prob default** (root cause of K3 ~40): `norm_topk_prob=None` in the HF checkpoint was interpreted as `False` by xorl, skipping routing weight normalization. HF always normalizes top-k weights. Fix: default to `True` when config value is `None`. Result: K3 drops from 38.76 → 0.003 2. **fused_recurrent kwargs bug**: `fused_recurrent_gated_delta_rule()` passed keyword arguments to `autograd.Function.apply()` which doesn't accept them, crashing any sequence ≤ 64 tokens. Closes #117 (cherry picked from commit e60e8817a0b9ca6bcd9762bdf8abf3c6505d4b5f)
…mory trade-off (#85) * Unified gradient_checkpointing_method: +12-22% MoE throughput Replace fragmented gradient checkpointing config (recompute_modules + moe_checkpoint_method) with a single gradient_checkpointing_method field: - recompute_full_layer (default): recompute entire decoder layer - recompute_before_dispatch: checkpoint attn+router, keep dispatch+expert+combine (+12%) - no_recompute: no recomputation, max throughput (+22%) Key changes: - MoEGradientCheckpointingLayer base class: provides _pre_dispatch_forward and _moe_forward so new MoE models only implement _pre_mlp_forward - MoEBlock.route() + forward_experts_only(): extract routing logic into reusable methods, eliminating 40-line duplication in decoder layers - Fix routing replay: cache routing_weights alongside selected_experts for deterministic checkpoint recompute (fixes NCCL timeout with recompute_full_layer + EP) - Remove moe_act kernel variants (QuackEPGroupGemmMoeAct, TritonEPGroupGemmMoeAct, native moe_act): benchmarked no measurable impact with recompute_before_dispatch - Remove sonic dead code, quack kernel warmup - Works with both alltoall and DeepEP backends - Docs updated across 6 pages Benchmarks: Qwen3-Coder-30B-A3B, quack + alltoall, 1-node EP8, H100 80GB, 32k seq recompute_full_layer: 17,600 tok/s, 37.5 GB peak recompute_before_dispatch: 19,990 tok/s, 44.5 GB peak (+13.6%) no_recompute: 21,800 tok/s, 54.8 GB peak (+23.9%) Convergence verified: all methods identical step-1 loss, monotonic convergence over 50 steps (spread ~0.36 at step 50, normal fp non-determinism). * Fix EnvironMeter kwarg: gc_enabled -> gradient_checkpointing_enabled The EnvironMeter.__init__ renamed `gc_enabled` to `gradient_checkpointing_enabled` and added a new `gradient_checkpointing_method` parameter. Update the trainer call site to use the new kwarg names so XorlFlopsCounter receives the correct gradient checkpointing configuration. * Flatten hidden_states before route() in _pre_dispatch_forward In the selective-checkpoint path (recompute_before_dispatch), _pre_dispatch_forward calls self.mlp.route() with the 3-D (batch, seq, hidden) tensor from _pre_mlp_forward. However, route() expects a flattened (num_tokens, hidden_dim) tensor because it runs softmax routing. Flatten before calling route to match the expected shape; forward_experts_only already handles the 3-D input separately. * Accept expert_scores in native/quack EP compute adapters The EP forward path now passes 6 arguments to compute_fn including expert_scores. The native and quack adapter wrappers still only accepted 5 positional args, causing a TypeError at call time. Add expert_scores as an optional kwarg to both wrappers to match the fused call signature from experts.py. * Fix MoE decoder layer attention-weight indexing in model loops _moe_forward returns (hidden_states,) or (hidden_states, router_logits) and never includes attention weights. When output_attentions=True the model loop was reading layer_outputs[1] as self-attention weights, but that slot actually held router_logits (when output_router_logits was also True). Use a None placeholder for attention weights instead, since the MoE selective-checkpoint path does not expose per-layer attention matrices. Fixes both Qwen3Moe and Qwen3.5Moe model classes. * Fix: pass expert_scores through native/quack EP wrappers Previous fix accepted the parameter but silently dropped it. Both native_ep_compute and QuackEPGroupGemm.apply already support expert_scores — the fused wrappers just weren't forwarding it. --------- Co-authored-by: qywu <qywu@together.ai> (cherry picked from commit b859ea2877e52c63f33f97b76d5405fed16bbc69)
…ibuted training hang (#136) * barrier bug fix * Update auto.py * Remove unused torch.distributed import from packing.py * trigger CI (cherry picked from commit 2e75acf14024f713b618114f6fe90b18de88ce1d)
* - Replace manual gate/up split + SiLU + mul with fused silu_and_mul kernel in TritonEPGroupGemm forward/backward (fewer intermediates, one kernel) - Fix quack/native EP adapters to forward expert_scores to downstream (was silently dropped, causing incorrect routing score application) - Fix quack/native moe_act adapters to accept expert_scores param (was TypeError on 6-arg call from Experts._ep_forward) - Add adapter-level regression tests for expert_scores forwarding * Fix lint: remove moe_act refs from test, fix ruff formatting moe_act was removed in PR #85. Update test to only check EP_EXPERT_COMPUTE adapters (quack/native). Fix assert formatting for ruff. Made-with: Cursor * Fix ruff lint: remove unused var, fix import order, format - Remove unused `I = ctx.intermediate_size` in TritonEPGroupGemm.backward - Move import_utils import to top-level in test file - Fix import sorting and trailing blank lines Made-with: Cursor (cherry picked from commit 7989ab38a6fd7a16c1a88918ab717650ecec6f07)
Implement Llama (covers Llama 2, 3, 3.1, 3.2, 3.3) as a new model family following the established Qwen3 pattern. Key differences from Qwen3: - No QK normalization (replaced with nn.Identity) - No sliding window attention - GQA with 8 KV heads (vs Qwen3's 32 MHA heads) - mlp_bias config support - Llama 3 defaults (128K vocab, rope_theta=500K, rms_norm_eps=1e-5) Verified on 8x H100: - Correctness: cosine sim 0.998 vs HF, 100% token agreement - FSDP2: 69K tok/s, 30.42 GB peak VRAM - TP4: 10.91->7.43 loss, 23 GB peak VRAM - PP2: 10.91->8.90 loss, 37.67 GB peak VRAM - LoRA (rank 16): 10.91->9.98 loss, 21.42 GB peak VRAM - QLoRA NVFP4: 13.82->10.67 loss, 21.56 GB peak VRAM (cherry picked from commit 6f3c5814cd81b7be9bd67490439a89342561ad45)
* feat(distributed): implement torchtitan eFSDP design for EP+FSDP2
Aligns the EP+FSDP2 implementation with torchtitan's eFSDP design, which
separates the DP-shard dimension into two named sub-meshes and guarantees
correct gradient handling regardless of optimizer choice.
Key changes:
**torch_parallelize.py — _build_ep_param_groups()**
Add this helper and call it at the end of parallelize_model_fsdp2() when
EP is enabled. Previously _ep_param_groups was only set by the EP-aware
optimizer builder, so ep_fsdp2_clip_grad_norm silently fell back to the
non-EP path when using a different optimizer or before optimizer init.
Now it is always set right after FSDP wrapping.
Classification: _skip_fsdp params -> ep group; DTensors on ep_fsdp mesh
-> ep group; everything else -> non_ep group.
**parallel_state.py — torchtitan-aligned mesh properties**
Add dp_shard_mod_ep_mesh/size and dp_shard_in_ep_mesh/size using
torchtitan's naming convention:
dp_shard_mod_ep = full DP shard mesh (all ep_size x ep_fsdp ranks)
FSDP mesh for non-expert params
dp_shard_in_ep = sub-mesh within one EP group (ep_fsdp ranks)
FSDP mesh for expert params
**trainer.py — peak memory logging**
Always print [PEAK_MEMORY] peak_alloc_gb=... to stdout in _finalize().
Parseable by benchmark scripts.
**scripts/benchmark_ep_fsdp_30b.py — eFSDP benchmark**
New script testing EP=1/2/4/8 on 8 GPUs with a 30B-proxy MoE config.
Reports tok/s, TFLOPs/GPU, peak GPU memory, grad norm, speedup per EP.
Supports --model-size full for actual Qwen3-30B-A3B runs on a cluster.
* fix(distributed): align eFSDP gradient division with torchtitan design
Expert gradients were being divided by ep_size twice: once via
set_gradient_divide_factor(ep_size) in torch_parallelize.py and again
via mul_(1/ep_size) in clip_grad_norm.py. This caused ep_size²
over-division of expert gradients, destabilising training at large EP.
torchtitan's design (disable_fsdp_gradient_division) sets the divide
factor to 1.0 for ALL modules including experts, relying on the loss
normalisation for uniform gradient handling. xorl now matches this.
Also adds EP=8+eFSDP=2 benchmark config/k8s manifests and fixes the
EP=16 reference config (correct dataset_prepared_path on shared PVC,
non-login shell to avoid .bashrc env issues in K8s pods).
* fix(moe): FP32 scatter_add in MoE combine for EP-size-independent precision
BF16 scatter_add_ in the MoE combine step produces order-dependent
rounding when accumulating top-K expert outputs per token. Different EP
sizes create different accumulation orders (different GPU placements),
causing routing divergence that cascades through 48 MoE layers.
Changes:
- utils.py: FP32 scatter_add in alltoall combine (eliminates 100% of
EP-size divergence for alltoall dispatch, verified bit-identical)
- deepep.py: FP32 scatter_add + index_add in DeepEP combine/backward
(reduces divergence; residual diff from DeepEP's two-stage architecture)
- torch_parallelize.py: improved comments explaining why factor=1.0 is
correct for both expert and non-expert FSDP gradient division
- clip_grad_norm.py: add per-group (expert vs non-expert) gradient norm
logging for debugging gradient scale issues
- trainer.py: add diagnostic tools (_save_moe_diagnostics,
_save_gradient_diagnostics, _save_logprob_diagnostics) for comparing
MoE routing indices, hidden states, and gradients across EP configs
Verification:
- alltoall + FP32 scatter_add: 0.0000% output diff, 100% routing match
across EP4/EP8/EP16 at all 48 layers (bit-identical)
- DeepEP + FP32 scatter_add: 0.29% residual diff at layer 0 from
NVSHMEM pre-accumulation of partial sums (architectural limitation)
- Expert gradient norms identical across EP configs (5.647 vs 5.650)
- No measurable throughput overhead from FP32 scatter_add
* feat(moe): configurable DeepEP combine dtype (DEEPEP_COMBINE_DTYPE)
Add support for FP16/FP32 combine precision in DeepEP dispatch, controlled
by the DEEPEP_COMBINE_DTYPE env var ("bf16", "fp16", "fp32").
Changes:
- deepep.py: add _get_combine_dtype() cached helper; cast scatter_add
output to the configured combine dtype before buffer.combine()
- deepep.py: cast combined_x back to model dtype after combine receives
- trainer.py: refactor MoE diagnostics to use hooks on the training
forward pass instead of a separate forward (avoids BF16/FP16 combine
dtype conflict that caused hangs)
Benchmark results (235B, 8-node, EP8+eFSDP8):
BF16 combine: 313 TFLOPs, 18.5s/step (baseline)
FP16 combine: 305 TFLOPs, 19.0s/step (+2.7%)
FP32 combine: 306 TFLOPs, 18.9s/step (+2.2%)
Precision (30B, 2-node, layer 0 output diff EP8 vs EP16):
BF16: 0.292% (16382 tokens differ)
FP16: 0.091% (16382 tokens, 3.2× better)
FP32: 0.000015% (2 tokens, 19500× better)
Requires DeepEP built with FP16/FP32 combine kernel support
(nv_half + float template instantiation in SWITCH_TYPES macro).
* refactor(moe): remove DEEPEP_COMBINE_DTYPE env var
Stick with BF16 combine for DeepEP. The FP16/FP32 combine options
showed minimal benefit at 235B scale (FP16: +24% overhead on EP16,
FP32: crashed EP16) while the FP32 scatter_add already handles
precision for the alltoall path.
The DeepEP combine still uses FP32 scatter_add on the expert GPU
before casting to BF16 for NVSHMEM transfer.
* Add benchmark configs, k8s manifests, diagnostics arg, and eFSDP grad tests
- Add save_step_diagnostics training arg for correctness verification
- Add Qwen3-235B 8-node and Qwen3-Coder-30B 2-node benchmark configs
(EP/eFSDP variants: alltoall, deepep, fp32, correctness, long)
- Add corresponding k8s manifests with NCCL IB and nccl node-group tolerations
- Update existing configs for single-step diagnostics runs
- Add eFSDP gradient scaling tests
* cleanup: remove diagnostics, debug logging, and excess benchmark configs
- Remove save_step_diagnostics arg and all diagnostic methods from trainer
- Remove debug gradient scale logging from clip_grad_norm
- Remove test scripts and benchmark script
- Remove diagnostic/long/fp32/alltoall config variants (keep core EP configs)
- Revert debug settings in base configs (max_steps back to 50)
* fix: merge resolution, train_router default, lint fixes
- Add missing weighted scatter line dropped during merge
- Default MoEBlock.train_router=False (matches args default, fixes deepep)
- Move peak memory logging to debug level
- Remove unused get_lm_head_weight import (ruff)
* style: remove redundant FSDPModule import (ruff)
* Fix DeepEP double-scoring and remove dead PEAK_MEMORY log
deepep.py: Remove score multiplication from _FusedUnpermuteAndCombine.forward().
Router weights are already applied by the expert compute function
(triton/native backends multiply by expert_scores). The previous code
double-applied scores (output * scores^2) and the backward path didn't
account for the factor, creating a gradient mismatch. Keep FP32
accumulation improvement.
trainer.py: Remove [PEAK_MEMORY] debug log that was gated behind
DEBUG level but the default is INFO, making it dead code.
* Address review: train_router default, double-scoring, stale env var
1. MoEBlock constructor default back to train_router=True so direct
callers (tests) still exercise router gradient propagation. Config-
level default remains False via getattr(config, "train_router", False).
2. Remove score multiplication from _FusedUnpermuteAndCombine.forward()
— expert compute already applies scores, this was double-counting.
Backward didn't account for the factor either (gradient mismatch).
3. Remove dead [PEAK_MEMORY] debug log (gated behind DEBUG, default is
INFO).
4. Remove DEEPEP_COMBINE_DTYPE from k8s manifests — env var was removed
in this branch, so these configs no longer represent distinct settings.
* Fix LoRA EP path: apply router scores after compute
LoRA EP compute functions (triton/native) don't accept expert_scores,
unlike the non-LoRA versions which multiply by scores internally.
After removing the score multiply from _FusedUnpermuteAndCombine,
the LoRA path had no score application at all.
Fix: apply scores in MoEExpertsLoRA._ep_forward() after compute
returns, matching the non-LoRA path where scores are applied inside
the compute function. Both paths now apply scores exactly once.
* Remove user-specific paths from benchmark configs and k8s manifests
Replace hardcoded /home/qywu paths with generic /workspace paths
matching the existing manifest conventions. Remove dataset_prepared_path
(not used in other configs) and use relative output_dir paths.
* Add review fixes: tests, ruff, _skip_fsdp guard, rename ep_outside → ep_intranode
- Fix ruff PLC0415: add noqa to DTensor import in _build_ep_param_groups
- Add _skip_fsdp + ep_fsdp_size > 1 assertion (missing eFSDP gradient sync)
- Add TestEPLoRARouterScores: verify score application for alltoall/deepep
contexts, no-scores identity, and gradient flow
- Add test_ep_clip_grad_norm.py: 16 tests covering param classification,
norm computation, clipping, no double-division, _skip_fsdp end-to-end
- Rename ep_outside → ep_intranode (inverted semantics, default True):
clearer name for "EP all-to-all stays within the node (NVLink)"
* Apply ruff-format to new test files
* Block all LoRA experts (not just QLoRA) with ep_fsdp_size > 1
LoRA + EP + eFSDP gradient interaction has not been validated.
Guard both regular LoRA (MoEExpertsLoRA) and QLoRA (_skip_fsdp)
expert modules when ep_fsdp_size > 1 until correctness is verified.
(cherry picked from commit d4c08c06f960a47c093e5ae64afb785b17eb8dfc)
The Llama 3 LoRA/QLoRA example configs listed unfused HF-style projection
names (q_proj, k_proj, v_proj, gate_proj, up_proj), but the model keeps
qkv_proj and gate_up_proj fused by default (merge_qkv=True). The LoRA
matcher is literal, so only o_proj and down_proj resolved and the rest
were silently skipped. Switch both configs to the fused names, matching
the Qwen examples.
GateUpMergeBuffer's pattern only matched .weight, so gate_proj.bias /
up_proj.bias passed through unmerged on load while save-side splitting
(substring match on .gate_up_proj.) already handled bias. Widen the
pattern to (weight|bias) and key pending state by (prefix, param_type),
mirroring QKVMergeBuffer. Other consumers gate the match behind
key.endswith(".weight"), so the widened pattern is inert for them.
(cherry picked from commit d020c39ae54bae6e8f228b490daebfc3136cdd42)
* Fix circular imports from c02dbf6 (lora/mapping, qlora) Restore lazy imports that were incorrectly moved to top-level in "Enforce import-outside-top-level" (#74), creating circular chains: - lora/mapping.py: move MoEExperts import back into _ensure_moe_mapping() - qlora/__init__.py: lazy __getattr__ for get_prequantized_exclude_modules - qlora/utils.py: lazy-import detect_prequantized_* from buffers.py Co-authored-by: zzz0906 <zzz879978@outlook.com> * [Feat] Add Qwen2ForCausalLM support for Qwen2.5 models Native xorl support for Qwen2/Qwen2.5 dense models (e.g. Qwen2.5-14B). - Qwen2Attention subclasses MultiHeadAttention (attention_bias=True in config handles QKV bias; only _init_sliding_window overridden) - Qwen2Config: Qwen2.5-14B defaults, rope_theta-in-rope_scaling handling, attention_bias=True, bos/eos token IDs - auto.py: Qwen2 config loading in _load_local_xorl_config - Checkpoint handler reuses Qwen3's (identical merge/split logic) - Benchmark config for 1-node 8xH100 with Ulysses CP=8 Verified argmax-exact match vs HF Qwen2ForCausalLM on Qwen2.5-14B. Tested: TP unfuse, PP=2, LoRA, QLoRA NVFP4. Co-authored-by: zzz0906 <zzz879978@outlook.com> * Update output_dir path in qwen2_5_14b_muon_1node.yaml * Fix Qwen2 fused HF checkpoint compatibility * Fix Qwen2 mixed sliding-window attention to match HF per-layer masking Two correctness issues with eager attention on mixed full/sliding-window Qwen2 configs: 1. _init_sliding_window fallback (layer_idx >= max_window_layers) could override explicit layer_types entries, making full_attention layers use sliding window. Now treats layer_types as authoritative when present. 2. Qwen2Model.forward built one global causal mask from config.sliding_window and reused it for every layer. Now builds separate full-attention and sliding-attention masks and selects per layer via config.layer_types, matching HF's causal_mask_mapping approach. * Simplify attention bias: use attention_bias for QKV, hardcode output bias=False Remove attention_qkv_bias and attention_output_bias indirection from MultiHeadAttention and Qwen2Config. All current models have output bias=False; QKV bias is driven by config.attention_bias directly. * Fix duplicate MoE import in lora/mapping.py after merge with main --------- Co-authored-by: zzz0906 <zzz879978@outlook.com> Co-authored-by: Ashwinee Panda <apanda@together.ai> (cherry picked from commit 794a16cc5d1361161e39d37753a26f5a862f33f0)
* Remove worker_port from inference endpoint registration * style: ruff-format collapse requests.post call in password test The multi-line add_inference_endpoint POST fits on one line under ruff-format's line-length budget. --------- Co-authored-by: Qingyang Wu <qingyang@together.ai> (cherry picked from commit 9c168c2ff8cebc904a66d47058e9de16508fc0fe)
* Fix all CPU test failures and collection errors - Add missing `List` import in `count_flops.py` - Add `pytest-asyncio` to test dependencies - Add `make_try_again_response` / `make_failed_response` helpers to `future_store.py` - Fix `validate_model_id` import path in `test_checkpoint_paths.py` - Guard `MOE_EXPERT_BACKENDS_MOE_ACT` import in `test_moe_act.py` - Fix mock missing `ringattn_parallel_size` / `ulysses_parallel_size` in `test_packing.py` - Fix HF datasets cache permission error in `test_shared.py` - Fix triton EP calling convention and remove nonexistent MoeAct params in `test_ep_routing_scores.py` - Remove unimplemented `test_moe_routing_cache.py` and unavailable flash-attn test * style: fix ruff-format in test_attention.py * refactor: move helper functions from future_store.py to test file Move make_try_again_response and make_failed_response into the test file instead of adding them to production code. Reverts the src change. (cherry picked from commit b4a0aa8d01e2ed4cda1dc0eab7401a88079e823c)
(cherry picked from commit dadbd2c747c70c303e27fd6709cd3b6fa17867a9)
(cherry picked from commit 4a74e79af8f9750447a0e3dd47bcb1dc422e6440)
| Or from source: | ||
|
|
||
| ```bash | ||
| pip install git+https://github.com/togethercomputer/xorl-client.git |
There was a problem hiding this comment.
Static Code Analysis Risk: Vulnerable and Outdated Components - Python external installer
A Python package is being installed from a non-standard, potentially untrusted source instead of the official PyPI registry. The flagged command installs from a URL (including Git or SSH sources), a local filesystem path, an archive file (.tar.gz, .zip, .whl), or uses --find-links to pull from an alternative package index. Note that installations via a requirements file (pip install -r requirements.txt) are not flagged by this rule.
If an attacker controls or compromises the external source, they can inject malicious code into the package, leading to arbitrary code execution in your build or deployment environment. This risk is especially high in CI/CD pipelines where install commands often run with elevated privileges.
Recommendation: Install from PyPI with pinned versions by replacing the external source with pip install <package>==<version> so the package comes from the official registry. If stronger guarantees are needed, list dependencies in a requirements.txt with --require-hashes (e.g., pip install --require-hashes -r requirements.txt), noting that --require-hashes only works with requirements files, not inline pip install commands. If an external source is unavoidable, pin Git-based installs to a specific commit or tag (e.g., git+https://example.com/repo.git@<commit-hash>) and verify package integrity using --hash on each entry in your requirements file. Finally, consider setting up a private package index that mirrors only approved packages rather than pointing --find-links at external URLs, avoiding reliance on sources outside your organization's control.
Severity: High 🚨
Status: Open 🔴
References:
Suggested reviewers 🧐: @qywu
More details:
If you see an issue, please contact Shasheen in the #security-engineering Slack channel.
Take action by replying with an [arnica] command 💬
Actions
Use [arnica] or [a] to interact with the Arnica bot to acknowledge or dismiss code risks.
To acknowledge the finding as a valid code risk: [arnica] ack <acknowledge additional details>
To dismiss the risk with a reason: [arnica] dismiss <fp|accept|capacity> <dismissal reason>
Examples
-
[arnica] ack This is a valid risk and I'm looking into it -
[arnica] dismiss fp Dismissed - Risk Not Accurate: (i.e. False Positive) -
[arnica] dismiss accept Dismiss - Risk Accepted: Allow the risk to exist in the system -
[arnica] dismiss capacity Dismiss - No Capacity: This will need to wait for a future sprint
d7a6019 to
98878b1
Compare
a7933a0 to
de00085
Compare
Summary
39 commits spanning new model support, MoE performance, LoRA/QLoRA improvements, new optimizers and loss functions, server/weight-sync fixes, and test/tooling improvements.
Models
MoE
LoRA / QLoRA
Optimizers & loss functions
Attention & kernels
Distributed training
Weight sync & checkpointing
Server / API
Data
Quality, testing & tooling
Test plan