Highlights
- Initial DeepSeek V4 support, covering Flash FSDP2 + EP training and DeepSeek V4 tool-call parsing and cleanup in #190 and #218
- Expand Qwen3.5 training with padding-free / packed-sequence support and Qwen3.5 MoE GatedDeltaNet sequence-parallel support in #186 and #222
- Add Gemma 4 multimodal training support #199
- Strengthen LoRA training with rsLoRA for Multi-LoRA, FSDP2 support for Multi-LoRA SFT, and Expert Parallelism LoRA SFT examples for DeepSeek V4 and Qwen3.5 MoE in #187, #155, and #198
- Improve NPU acceleration and stability with fused operators, Qwen3.5 FLA patches, Group MatMul EP scoping, and sequence-parallel compatibility fixes in #194, #204, #205, #206, and #208
New Features
- Add padding-free and packed-sequence support for Qwen3.5 by @meichangsu1 in #186
- Add rsLoRA support to Multi-LoRA by @xichengpro in #187
- Add FSDP2 support for Multi-LoRA SFT by @kevssim in #155
- Add DeepSeek V4 Flash FSDP2 + EP training support by @meichangsu1 in #190
- Add NPU fused operators: RMSNorm, RoPE, SwiGLU, and SDPA by @ys2025-AI in #194
- Add multi-turn rollout support by @tastelikefeet in #193
- Add support for client-specified checkpoint saving paths by @vx120 in #196
- Add LoRA SFT support for Expert Parallelism, with DeepSeek V4 and Qwen3.5 MoE examples by @kevssim in #198
- Add Qwen3.5 NPU FLA and fused-operator patches by @ys2025-AI in #204
- Add LoRA capacity query support by @kevssim in #201
- Optimize Native FSDP memory_efficient_init weight loading for multi-node EP/FSDP jobs and add multi-node scripts by @meichangsu1 in #207
- Add Gemma 4 support by @EvineR666 in #199
- Add DeepSeek V4 tool-call parsing and cleanup support by @meichangsu1 in #218
- Add Gemma 4 12B cookbook by @EvineR666 in #219
- Add automatic device detection by @vx120 in #220
- Add Qwen3.5 MoE GatedDeltaNet sequence-parallel support by @meichangsu1 in #222
- Refactor server configuration and observability by @Yunnglin in #210
Bug Fixes
- Fix cache reset behavior for multimodal models by @hjh0119 in #189
- Fix Qwen3.5 GatedDeltaNet padding-free compatibility and create_causal_mask compatibility after cache_positions removal in transformers >5.3.0 by @meichangsu1 in #202
- Fix transformers 5.9 AttentionMask wrapper compatibility in sequence parallel by @ys2025-AI in #206
- Fix SP path overriding the NPU-patched chunk_gated_delta_rule by @ys2025-AI in #208
- Fix NPU Group MatMul patch scope so it only applies in EP scenarios by @0hujun in #205
- Fix adapter saving to use the MultiLora state dict by @meichangsu1 in #215
更新内容
亮点功能
- 首发支持 DeepSeek V4,覆盖 Flash FSDP2 + EP 训练,以及 DeepSeek V4 tool call 解析与清理 in #190 and #218
- 扩展 Qwen3.5 训练能力,新增 padding-free / packed-sequence 支持和 Qwen3.5 MoE GatedDeltaNet sequence parallel 支持 in #186 and #222
- 新增 Gemma 4 多模态训练支持 in #199
- 增强 LoRA 训练能力,覆盖 Multi-LoRA 的 rsLoRA、Multi-LoRA SFT 的 FSDP2 支持,以及 DeepSeek V4 / Qwen3.5 MoE 的 EP LoRA SFT 示例 in #187, #155, and #198
- 增强 NPU 加速与稳定性,覆盖融合算子、Qwen3.5 FLA patch、Group MatMul EP 以及 sequence-parallel 兼容性修复 in #194, #204, #205, #206, and #208
新特性
- 支持 Qwen3.5 padding-free / packed-sequence 训练 by @meichangsu1 in #186
- Multi-LoRA 支持 rsLoRA by @xichengpro in #187
- Multi-LoRA SFT 支持 FSDP2 by @kevssim in #155
- 支持 DeepSeek V4 Flash FSDP2 + EP 训练 by @meichangsu1 in #190
- 新增 NPU 融合算子:RMSNorm、RoPE、SwiGLU、SDPA by @ys2025-AI in #194
- 支持 multi-turn rollout by @tastelikefeet in #193
- 支持客户端指定服务端路径保存 checkpoint by @vx120 in #196
- EP 支持 LoRA SFT,并新增 DeepSeek V4 和 Qwen3.5 MoE 示例 by @kevssim in #198
- 新增 Qwen3.5 NPU FLA 与融合算子补丁 by @ys2025-AI in #204
- 支持查询 LoRA capacity 信息 by @kevssim in #201
- 优化 Native FSDP memory_efficient_init 多节点 EP/FSDP 权重加载,并新增多节点脚本 by @meichangsu1 in #207
- 新增 Gemma 4 支持 by @EvineR666 in #199
- 新增 DeepSeek V4 tool call 解析与清理支持 by @meichangsu1 in #218
- 新增 Gemma 4 12B cookbook by @EvineR666 in #219
- 新增自动显卡设备检测 by @vx120 in #220
- 支持 Qwen3.5 MoE GatedDeltaNet sequence parallel by @meichangsu1 in #222
- 服务端配置与可观测性重构 by @Yunnglin in #210
Bug 修复
- 修复多模态模型 cache reset 问题 by @hjh0119 in #189
- 修复 Qwen3.5 GatedDeltaNet padding-free 训练兼容性,并兼容 transformers >5.3.0 中 cache_positions 移除后的 create_causal_mask 逻辑 by @meichangsu1 in #202
- 修复 sequence parallel 中 transformers 5.9 AttentionMask wrapper 兼容问题 by @ys2025-AI in #206
- 修复 SP 路径覆盖 NPU patch 后的 chunk_gated_delta_rule 问题 by @ys2025-AI in #208
- 修复 NPU Group MatMul patch 作用范围,限定仅在 EP 场景启用 by @0hujun in #205
- 修复保存 adapter 时未使用 MultiLora state dict 的问题 by @meichangsu1 in #215
New Contributors
- @tpx818 made their first contribution in #65
- @wangxingjun778 made their first contribution in #68
- @hzher made their first contribution in #92
- @xichengpro made their first contribution in #123
- @vx120 made their first contribution in #118
- @0hujun made their first contribution in #183
- @a550580874 made their first contribution in #176
- @ys2025-AI made their first contribution in #194
- @EvineR666 made their first contribution in #199
Full Changelog: https://github.com/modelscope/twinkle/commits/v0.4.0