[NV - WIP] Qwen3.5 B200 SGLang FP4 configs#820
Conversation
|
@claude please add a PR description |
|
Claude finished @kedarpotdar-nv's task in 40s —— View job Adding PR Description
PR description has been added with a summary of all changes: new |
|
@claude please review the PR |
|
Claude finished @kedarpotdar-nv's task in 2m 41s —— View job Reviewing PR
Review:
|
| Area | Verdict |
|---|---|
| Config YAML | ✅ Good |
| Launch script | 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval |
| Perf changelog | ✅ Good |
The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.
|
@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168 Thanks! |
Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend |
|
/sweep run only evals |
|
@kedarpotdar-nv Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181 |
|
@kedarpotdar-nv some merge conflicts then looks ready to go? |
|
@ankursingh-nv will take over |
86e3eec to
8d9f7c9
Compare
|
@claude resolve the merge conflicts & update the PRs to follow the updated conventions. remember that qwen3.5-fp4-b200-sglang should be at the bottom of perf-changelog.yaml, remember to delete 1k/8k from nvidia-master, remember to do the EVAL_CONTEXT_ARGS thing |
|
Claude finished @functionstackx's task in 3m 16s —— View job Resolving Merge Conflicts & Updating Conventions
SummaryResolved merge conflicts by rebasing cleanly on nvidia-master.yaml — Added benchmarks/single_node/qwen3.5_fp4_b200.sh — Rewritten to follow updated conventions from the fp8 sibling script:
perf-changelog.yaml — Entry added at the bottom. |
- Add qwen3.5-fp4-b200-sglang config to nvidia-master.yaml (1k1k, 8k1k) - Add launch script following updated conventions (dynamic scheduler interval, EVAL_CONTEXT_ARGS, tokenizer-worker-num, conditional allreduce fusion) - Add perf-changelog entry Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e5c143d to
e1e8cb2
Compare
|
@kedarpotdar-nv @ankursingh-nv the eval score is not good , it doesnt even pass an relaxed floor model eval threshold https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23698363526/job/69037542640?pr=820
+viz @Oseltamivir |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |

Summary
Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.
Changes
New Benchmark Config (
nvidia-master.yaml)qwen3.5-fp4-b200-sglangnvidia/Qwen3.5-397B-A17B-NVFP4lmsysorg/sglang:v0.5.9-cu129-amd641k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)1k8k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)8k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)New Launch Script (
benchmarks/single_node/qwen3.5_fp4_b200.sh)SGLang server configuration with:
--quantization modelopt_fp4with--fp4-gemm-backend flashinfer_cutlass--kv-cache-dtype fp8_e4m3--attention-backend trtllm_mha/--moe-runner-backend flashinfer_trtllm--enable-flashinfer-allreduce-fusion--chunked-prefill-size 32768/--max-prefill-tokens 32768--disable-radix-cache--mem-fraction-static 0.85Perf Changelog
qwen3.5-fp4-b200-sglangconfig.