Release v0.4.0 · modelscope/twinkle

Initial DeepSeek V4 support, covering Flash FSDP2 + EP training and DeepSeek V4 tool-call parsing and cleanup in #190 and #218
Expand Qwen3.5 training with padding-free / packed-sequence support and Qwen3.5 MoE GatedDeltaNet sequence-parallel support in #186 and #222
Add Gemma 4 multimodal training support #199
Strengthen LoRA training with rsLoRA for Multi-LoRA, FSDP2 support for Multi-LoRA SFT, and Expert Parallelism LoRA SFT examples for DeepSeek V4 and Qwen3.5 MoE in #187, #155, and #198
Improve NPU acceleration and stability with fused operators, Qwen3.5 FLA patches, Group MatMul EP scoping, and sequence-parallel compatibility fixes in #194, #204, #205, #206, and #208

Add padding-free and packed-sequence support for Qwen3.5 by @meichangsu1 in #186
Add rsLoRA support to Multi-LoRA by @xichengpro in #187
Add FSDP2 support for Multi-LoRA SFT by @kevssim in #155
Add DeepSeek V4 Flash FSDP2 + EP training support by @meichangsu1 in #190
Add NPU fused operators: RMSNorm, RoPE, SwiGLU, and SDPA by @ys2025-AI in #194
Add multi-turn rollout support by @tastelikefeet in #193
Add support for client-specified checkpoint saving paths by @vx120 in #196
Add LoRA SFT support for Expert Parallelism, with DeepSeek V4 and Qwen3.5 MoE examples by @kevssim in #198
Add Qwen3.5 NPU FLA and fused-operator patches by @ys2025-AI in #204
Add LoRA capacity query support by @kevssim in #201
Optimize Native FSDP memory_efficient_init weight loading for multi-node EP/FSDP jobs and add multi-node scripts by @meichangsu1 in #207
Add Gemma 4 support by @EvineR666 in #199
Add DeepSeek V4 tool-call parsing and cleanup support by @meichangsu1 in #218
Add Gemma 4 12B cookbook by @EvineR666 in #219
Add automatic device detection by @vx120 in #220
Add Qwen3.5 MoE GatedDeltaNet sequence-parallel support by @meichangsu1 in #222
Refactor server configuration and observability by @Yunnglin in #210

Fix cache reset behavior for multimodal models by @hjh0119 in #189
Fix Qwen3.5 GatedDeltaNet padding-free compatibility and create_causal_mask compatibility after cache_positions removal in transformers >5.3.0 by @meichangsu1 in #202
Fix transformers 5.9 AttentionMask wrapper compatibility in sequence parallel by @ys2025-AI in #206
Fix SP path overriding the NPU-patched chunk_gated_delta_rule by @ys2025-AI in #208
Fix NPU Group MatMul patch scope so it only applies in EP scenarios by @0hujun in #205
Fix adapter saving to use the MultiLora state dict by @meichangsu1 in #215

首发支持 DeepSeek V4，覆盖 Flash FSDP2 + EP 训练，以及 DeepSeek V4 tool call 解析与清理 in #190 and #218
扩展 Qwen3.5 训练能力，新增 padding-free / packed-sequence 支持和 Qwen3.5 MoE GatedDeltaNet sequence parallel 支持 in #186 and #222
新增 Gemma 4 多模态训练支持 in #199
增强 LoRA 训练能力，覆盖 Multi-LoRA 的 rsLoRA、Multi-LoRA SFT 的 FSDP2 支持，以及 DeepSeek V4 / Qwen3.5 MoE 的 EP LoRA SFT 示例 in #187, #155, and #198
增强 NPU 加速与稳定性，覆盖融合算子、Qwen3.5 FLA patch、Group MatMul EP 以及 sequence-parallel 兼容性修复 in #194, #204, #205, #206, and #208

支持 Qwen3.5 padding-free / packed-sequence 训练 by @meichangsu1 in #186
Multi-LoRA 支持 rsLoRA by @xichengpro in #187
Multi-LoRA SFT 支持 FSDP2 by @kevssim in #155
支持 DeepSeek V4 Flash FSDP2 + EP 训练 by @meichangsu1 in #190
新增 NPU 融合算子：RMSNorm、RoPE、SwiGLU、SDPA by @ys2025-AI in #194
支持 multi-turn rollout by @tastelikefeet in #193
支持客户端指定服务端路径保存 checkpoint by @vx120 in #196
EP 支持 LoRA SFT，并新增 DeepSeek V4 和 Qwen3.5 MoE 示例 by @kevssim in #198
新增 Qwen3.5 NPU FLA 与融合算子补丁 by @ys2025-AI in #204
支持查询 LoRA capacity 信息 by @kevssim in #201
优化 Native FSDP memory_efficient_init 多节点 EP/FSDP 权重加载，并新增多节点脚本 by @meichangsu1 in #207
新增 Gemma 4 支持 by @EvineR666 in #199
新增 DeepSeek V4 tool call 解析与清理支持 by @meichangsu1 in #218
新增 Gemma 4 12B cookbook by @EvineR666 in #219
新增自动显卡设备检测 by @vx120 in #220
支持 Qwen3.5 MoE GatedDeltaNet sequence parallel by @meichangsu1 in #222
服务端配置与可观测性重构 by @Yunnglin in #210

修复多模态模型 cache reset 问题 by @hjh0119 in #189
修复 Qwen3.5 GatedDeltaNet padding-free 训练兼容性，并兼容 transformers >5.3.0 中 cache_positions 移除后的 create_causal_mask 逻辑 by @meichangsu1 in #202
修复 sequence parallel 中 transformers 5.9 AttentionMask wrapper 兼容问题 by @ys2025-AI in #206
修复 SP 路径覆盖 NPU patch 后的 chunk_gated_delta_rule 问题 by @ys2025-AI in #208
修复 NPU Group MatMul patch 作用范围，限定仅在 EP 场景启用 by @0hujun in #205
修复保存 adapter 时未使用 MultiLora state dict 的问题 by @meichangsu1 in #215

New Contributors

Full Changelog: https://github.com/modelscope/twinkle/commits/v0.4.0