Skip to content

Add Qwen3.5 FP8 B200 SGLang MTP config#898

Open
ankursingh-nv wants to merge 10 commits intomainfrom
qwen-sglang-b200-mtp-fp8
Open

Add Qwen3.5 FP8 B200 SGLang MTP config#898
ankursingh-nv wants to merge 10 commits intomainfrom
qwen-sglang-b200-mtp-fp8

Conversation

@ankursingh-nv
Copy link
Collaborator

@ankursingh-nv ankursingh-nv commented Mar 9, 2026

Summary

Adds a new benchmark configuration for Qwen3.5-397B-A17B FP8 on B200 using SGLang with MTP (Multi-Token Prediction) via EAGLE speculative decoding.

Changes

New benchmark script: benchmarks/single_node/qwen3.5_fp8_b200_mtp.sh

  • SGLang launch with FP8 quantization and FP8 E4M3 KV cache
  • EAGLE speculative decoding config: num-steps=3, draft-tokens=4, topk=1
  • FlashInfer + TRT-LLM backends (fp8-gemm-backend=flashinfer_trtllm, attention-backend=trtllm_mha, moe-runner-backend=flashinfer_trtllm)
  • FlashInfer allreduce fusion enabled
  • Adaptive scheduler recv interval (10 for low concurrency, 30 for CONC >= 16)
  • Radix cache disabled, context length dynamically set from ISL + OSL

Config entry: qwen3.5-fp8-b200-sglang-mtp in nvidia-master.yaml

  • Image: lmsysorg/sglang:v0.5.9-cu130
  • Model: Qwen/Qwen3.5-397B-A17B-FP8
  • Single-node, TP=4, EP=1
  • Concurrency sweep: 4–256 across all three sequence-length configs (1k/1k, 1k/8k, 8k/1k)

Changelog entry in perf-changelog.yaml

  • Documents the new config addition with spec-decoding details

Test Plan

  • Run e2e benchmarks for qwen3.5-fp8-b200-sglang-mtp to validate server startup and benchmark completion
  • Verify MTP speculative decoding produces correct results across sequence lengths
  • Compare throughput/latency against the non-MTP qwen3.5-fp8-b200-sglang baseline

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@ankursingh-nv ankursingh-nv force-pushed the qwen-sglang-b200-mtp-fp8 branch 2 times, most recently from 5de6e8c to aeecb6b Compare March 10, 2026 20:06
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

update pr number

fix max running request
@ankursingh-nv ankursingh-nv force-pushed the qwen-sglang-b200-mtp-fp8 branch from aeecb6b to 83fafdd Compare March 11, 2026 19:26
@cquil11
Copy link
Collaborator

cquil11 commented Mar 18, 2026

@claude add PR description

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 18, 2026

Claude finished @cquil11's task in 49s —— View job


Adding PR Description

  • Read changed files to understand the PR
  • Write and update PR description

PR description has been added. It covers the new benchmark script, the nvidia-master.yaml config entry, and the changelog update.

@ankursingh-nv ankursingh-nv enabled auto-merge (squash) March 18, 2026 23:59
@ankursingh-nv ankursingh-nv changed the title [WIP] Add Qwen3.5 FP8 B200 SGLang MTP config Add Qwen3.5 FP8 B200 SGLang MTP config Mar 19, 2026
@ankursingh-nv
Copy link
Collaborator Author

@jgangani @functionstackx can you guys please review and approve the PR?

@ankursingh-nv ankursingh-nv disabled auto-merge March 19, 2026 16:07
@jgangani
Copy link
Collaborator

@jgangani @functionstackx can you guys please review and approve the PR?

@Ankur-singh Looks good to me. Would be good to revisit later EAGLE draft tokens < 4 for higher concurrencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants