[DeepSeek MoE] current workstream planning

Making an issue to track expected work for DeepSeek experimental:

1 - Integrate DeepGEMM support (contiguous) as an additional inference option - this uses groupwise/blockwise fp8 quantization - completed (https://github.com/pytorch/torchtitan/pull/1124)
1A - add triton contigous group gemm (AMD compat) - completed (https://github.com/pytorch/torchtitan/pull/1154)
2 - refactor token processing to avoid code duplication. = PR https://github.com/pytorch/torchtitan/pull/1127
3 - add proper training loop support - initial working PR landed.  (see train_ds_real.py). 
4 - need basic unit tests for checkins 
5 - review AMD port for Symmetric Memory. (PR merged into PT core, need to verify run on AMD). 
6 - finalize which groupGEMM's we want to support long term (torch bf16 + DeepSeek for fp8?).  AMD? 
_updates_ - 
fix for torch.group_gemm hang (https://github.com/pytorch/torchtitan/pull/1166) so this has full training support now.   
  torch.__scaled_mm with wrappers via torchAO and thus fp8 rowwise has been added for ds inference now.  (https://github.com/pytorch/torchtitan/pull/1142)

7 - implement stats tracking for experts (exflow optimization) and subsequent more efficient expert placement.  (initial stats tracking added, but only tracks topk1, needs topk6).  Update = initial token tracking in place for topk==1, need to expand to topk==6.

8 - large scale training runs to prove out everything. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeepSeek MoE] current workstream planning #1125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DeepSeek MoE] current workstream planning #1125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions