Skip to content

v0.4.0

Latest

Choose a tag to compare

@tpx818 tpx818 released this 16 Jun 08:57
· 43 commits to release/0.4.0 since this release

Highlights

  • Initial DeepSeek V4 support, covering Flash FSDP2 + EP training and DeepSeek V4 tool-call parsing and cleanup in #190 and #218
  • Expand Qwen3.5 training with padding-free / packed-sequence support and Qwen3.5 MoE GatedDeltaNet sequence-parallel support in #186 and #222
  • Add Gemma 4 multimodal training support #199
  • Strengthen LoRA training with rsLoRA for Multi-LoRA, FSDP2 support for Multi-LoRA SFT, and Expert Parallelism LoRA SFT examples for DeepSeek V4 and Qwen3.5 MoE in #187, #155, and #198
  • Improve NPU acceleration and stability with fused operators, Qwen3.5 FLA patches, Group MatMul EP scoping, and sequence-parallel compatibility fixes in #194, #204, #205, #206, and #208

New Features

  • Add padding-free and packed-sequence support for Qwen3.5 by @meichangsu1 in #186
  • Add rsLoRA support to Multi-LoRA by @xichengpro in #187
  • Add FSDP2 support for Multi-LoRA SFT by @kevssim in #155
  • Add DeepSeek V4 Flash FSDP2 + EP training support by @meichangsu1 in #190
  • Add NPU fused operators: RMSNorm, RoPE, SwiGLU, and SDPA by @ys2025-AI in #194
  • Add multi-turn rollout support by @tastelikefeet in #193
  • Add support for client-specified checkpoint saving paths by @vx120 in #196
  • Add LoRA SFT support for Expert Parallelism, with DeepSeek V4 and Qwen3.5 MoE examples by @kevssim in #198
  • Add Qwen3.5 NPU FLA and fused-operator patches by @ys2025-AI in #204
  • Add LoRA capacity query support by @kevssim in #201
  • Optimize Native FSDP memory_efficient_init weight loading for multi-node EP/FSDP jobs and add multi-node scripts by @meichangsu1 in #207
  • Add Gemma 4 support by @EvineR666 in #199
  • Add DeepSeek V4 tool-call parsing and cleanup support by @meichangsu1 in #218
  • Add Gemma 4 12B cookbook by @EvineR666 in #219
  • Add automatic device detection by @vx120 in #220
  • Add Qwen3.5 MoE GatedDeltaNet sequence-parallel support by @meichangsu1 in #222
  • Refactor server configuration and observability by @Yunnglin in #210

Bug Fixes

  • Fix cache reset behavior for multimodal models by @hjh0119 in #189
  • Fix Qwen3.5 GatedDeltaNet padding-free compatibility and create_causal_mask compatibility after cache_positions removal in transformers >5.3.0 by @meichangsu1 in #202
  • Fix transformers 5.9 AttentionMask wrapper compatibility in sequence parallel by @ys2025-AI in #206
  • Fix SP path overriding the NPU-patched chunk_gated_delta_rule by @ys2025-AI in #208
  • Fix NPU Group MatMul patch scope so it only applies in EP scenarios by @0hujun in #205
  • Fix adapter saving to use the MultiLora state dict by @meichangsu1 in #215

更新内容

亮点功能

  • 首发支持 DeepSeek V4,覆盖 Flash FSDP2 + EP 训练,以及 DeepSeek V4 tool call 解析与清理 in #190 and #218
  • 扩展 Qwen3.5 训练能力,新增 padding-free / packed-sequence 支持和 Qwen3.5 MoE GatedDeltaNet sequence parallel 支持 in #186 and #222
  • 新增 Gemma 4 多模态训练支持 in #199
  • 增强 LoRA 训练能力,覆盖 Multi-LoRA 的 rsLoRA、Multi-LoRA SFT 的 FSDP2 支持,以及 DeepSeek V4 / Qwen3.5 MoE 的 EP LoRA SFT 示例 in #187, #155, and #198
  • 增强 NPU 加速与稳定性,覆盖融合算子、Qwen3.5 FLA patch、Group MatMul EP 以及 sequence-parallel 兼容性修复 in #194, #204, #205, #206, and #208

新特性

Bug 修复

  • 修复多模态模型 cache reset 问题 by @hjh0119 in #189
  • 修复 Qwen3.5 GatedDeltaNet padding-free 训练兼容性,并兼容 transformers >5.3.0 中 cache_positions 移除后的 create_causal_mask 逻辑 by @meichangsu1 in #202
  • 修复 sequence parallel 中 transformers 5.9 AttentionMask wrapper 兼容问题 by @ys2025-AI in #206
  • 修复 SP 路径覆盖 NPU patch 后的 chunk_gated_delta_rule 问题 by @ys2025-AI in #208
  • 修复 NPU Group MatMul patch 作用范围,限定仅在 EP 场景启用 by @0hujun in #205
  • 修复保存 adapter 时未使用 MultiLora state dict 的问题 by @meichangsu1 in #215

New Contributors

Full Changelog: https://github.com/modelscope/twinkle/commits/v0.4.0