Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

gpt-oss model has a hidden-dim of 2880. At the moment, to transmit tensors of [M, 2880] with DeepEP low_latency kernels , we pad it up to 4096 which is too big of a padding.

This PR adds 3072 to the list of low_latency kernels so we may pad up to only 3072 during model execution.

Test

python3 test_low_latency.py --num-processes=2 --hidden 3072
Reserved 2 GPU(s): [7 3] for command execution
Allocating buffer size: 920.127872 MB ...
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1935: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3630: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1935: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3630: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:282: init failed for transport: IBGDA
[rank 0] Dispatch + combine bandwidth: 102.71 GB/s, avg_t=92.09 us, min_t=89.79 us, max_t=94.69 us
[rank 1] Dispatch + combine bandwidth: 102.51 GB/s, avg_t=92.27 us, min_t=89.95 us, max_t=94.46 us
[rank 1] Dispatch bandwidth: 71.05 GB/s, avg_t=45.44 us | Combine bandwidth: 142.25 GB/s, avg_t=43.80 us
[rank 0] Dispatch bandwidth: 68.22 GB/s, avg_t=47.33 us | Combine bandwidth: 148.64 GB/s, avg_t=41.91 us
[rank 1] Dispatch send/recv time: 26.28 + 6.98 us | Combine send/recv time: 21.07 + 9.52 us
[rank 0] Dispatch send/recv time: 43.67 + 7.09 us | Combine send/recv time: 36.00 + 9.58 us

Ref : vLLM PR using gpt-oss with DeepEP Low Latency kernels vllm-project/vllm#25997

Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Copy link

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be great to get this in 🙏

@sphish sphish merged commit 73b6ea4 into deepseek-ai:main Oct 17, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants