Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2020,6 +2020,28 @@ qwen3.5-fp8-h200-sglang:
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }

qwen3.5-fp8-h200-sglang-mtp:
image: lmsysorg/sglang:v0.5.9-cu129-amd64
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: h200
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }

glm5-fp8-h200-sglang:
image: lmsysorg/sglang:glm5-hopper
model: zai-org/GLM-5-FP8
Expand Down
90 changes: 90 additions & 0 deletions benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
MAX_SEQ_LEN=$((ISL + OSL + 20))

echo "CONC: $CONC, ISL: $ISL, OSL: $OSL, MAX_SEQ_LEN: $MAX_SEQ_LEN"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
python3 -m sglang.launch_server \
--model "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--tp "$TP" \
--expert-parallel-size "$EP_SIZE" \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-flashinfer-allreduce-fusion \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--decode-log-interval 1 \
--mem-fraction-static 0.8 \
--cuda-graph-max-bs "$CONC" \
--context-length "$MAX_SEQ_LEN" \
--kv-cache-dtype fp8_e4m3 \
--quantization fp8 \
--attention-backend flashinfer \
--stream-interval 50 \
--tokenizer-worker-num 6 \
--mamba-ssm-dtype bfloat16 \
--disable-radix-cache \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 2 \
--speculative-num-draft-tokens 3 \
--speculative-eagle-topk 1 \
> "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
Comment on lines +67 to +77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The run_benchmark_serving call is missing --use-chat-template, which every other MTP benchmark script in the repo (6 out of 6) includes. Without this flag, MTP acceptance rates are artificially high because raw text without chat formatting special tokens is easier for the draft model to predict, producing misleading benchmark results. Add --use-chat-template after the --result-dir line to match the established pattern.

Extended reasoning...

What the bug is

The new qwen3.5_fp8_h200_mtp.sh benchmark script omits --use-chat-template from its run_benchmark_serving call (lines 69-79). This flag is present in every other MTP benchmark script in the repository.

Evidence of the pattern

All 6 existing single-node MTP benchmark scripts include --use-chat-template:

  • dsr1_fp8_b200_mtp.sh (line 108)
  • dsr1_fp4_b200_trt_mtp.sh (line 133)
  • dsr1_fp8_b200_trt_mtp.sh (line 143)
  • dsr1_fp8_h200_trt_mtp.sh (line 115)
  • dsr1_fp4_mi355x_atom_mtp.sh (line 71)
  • dsr1_fp8_mi355x_atom_mtp.sh (line 70)

Additionally, the multi-node AMD utility (bench.sh:60) adds this flag generically for ALL MTP benchmarks via [ "$IS_MTP" = "true" ] && echo "--use-chat-template", confirming this is a model-agnostic requirement, not DeepSeek-specific.

Root cause

The script was likely copied from the non-MTP qwen3.5_fp8_h200.sh (which correctly omits the flag since MTP acceptance rates are irrelevant without speculative decoding) but failed to add --use-chat-template as all other MTP scripts do.

Step-by-step proof of impact

  1. The benchmark runs with EAGLE speculative decoding enabled (--speculative-algorithm EAGLE, lines 55-58).
  2. run_benchmark_serving sends prompts to the server. Without --use-chat-template, raw text is sent without chat formatting special tokens.
  3. The draft model finds raw text easier to predict than properly formatted chat messages (which contain special tokens like <|im_start|>, <|im_end|>, etc.).
  4. This results in artificially higher MTP acceptance rates.
  5. The benchmark reports misleadingly optimistic throughput numbers that won't reflect real-world chat serving performance.

How to fix

Add --use-chat-template to the run_benchmark_serving call, e.g. after --result-dir /workspace/. This is a one-line addition that aligns the script with every other MTP benchmark in the repository.

--use-chat-template \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -979,3 +979,9 @@
- "Benchmark script: benchmarks/single_node/glm5_fp8_h200.sh"
- "Tool-call-parser glm47, reasoning-parser glm45, mem-fraction-static 0.85"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/914

- config-keys:
- qwen3.5-fp8-h200-sglang-mtp
description:
- "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.

Extended reasoning...

Bug Description

The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

instead of the actual PR number:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Code Path

The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.

Pre-existing Context

There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.

Impact

The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.

Suggested Fix

Replace XXX with 921 on line 987:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.

Loading