Skip to content

[DeepSeek MoE] current workstream planning #1125

@lessw2020

Description

@lessw2020

Making an issue to track expected work for DeepSeek experimental:

1 - Integrate DeepGEMM support (contiguous) as an additional inference option - this uses groupwise/blockwise fp8 quantization - completed (#1124)
1A - add triton contigous group gemm (AMD compat) - completed (#1154)
2 - refactor token processing to avoid code duplication. = PR #1127
3 - add proper training loop support - initial working PR landed. (see train_ds_real.py).
4 - need basic unit tests for checkins
5 - review AMD port for Symmetric Memory. (PR merged into PT core, need to verify run on AMD).
6 - finalize which groupGEMM's we want to support long term (torch bf16 + DeepSeek for fp8?). AMD?
updates -
fix for torch.group_gemm hang (#1166) so this has full training support now.
torch.__scaled_mm with wrappers via torchAO and thus fp8 rowwise has been added for ds inference now. (#1142)

7 - implement stats tracking for experts (exflow optimization) and subsequent more efficient expert placement. (initial stats tracking added, but only tracks topk1, needs topk6). Update = initial token tracking in place for topk==1, need to expand to topk==6.

8 - large scale training runs to prove out everything.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions