-
Notifications
You must be signed in to change notification settings - Fork 434
Description
Making an issue to track expected work for DeepSeek experimental:
1 - Integrate DeepGEMM support (contiguous) as an additional inference option - this uses groupwise/blockwise fp8 quantization - completed (#1124)
1A - add triton contigous group gemm (AMD compat) - completed (#1154)
2 - refactor token processing to avoid code duplication. = PR #1127
3 - add proper training loop support - initial working PR landed. (see train_ds_real.py).
4 - need basic unit tests for checkins
5 - review AMD port for Symmetric Memory. (PR merged into PT core, need to verify run on AMD).
6 - finalize which groupGEMM's we want to support long term (torch bf16 + DeepSeek for fp8?). AMD?
updates -
fix for torch.group_gemm hang (#1166) so this has full training support now.
torch.__scaled_mm with wrappers via torchAO and thus fp8 rowwise has been added for ds inference now. (#1142)
7 - implement stats tracking for experts (exflow optimization) and subsequent more efficient expert placement. (initial stats tracking added, but only tracks topk1, needs topk6). Update = initial token tracking in place for topk==1, need to expand to topk==6.
8 - large scale training runs to prove out everything.