Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
|
||
| - config-keys: | ||
| - dsr1-fp4-mi355x-sglang-disagg-mtp | ||
| - dsr1-fp8-mi355x-sglang-disagg-mtp | ||
| description: | ||
| - "test" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog entry has placeholder values: pr-link is pull/XXX (should be pull/917), description is just "test" instead of describing the actual change (adding --speculative-num-draft-tokens 3 to MTP flags), and the entry overall is not ready for merge. Please update the PR link, write a meaningful description, and verify the config-keys cover all affected benchmarks.
Extended reasoning...
What the bug is
The perf-changelog entry added at lines 964-970 contains three placeholder values that are clearly not production-ready:
pr-link: pull/XXX— This is a placeholder URL. Every changelog entry needs a working link to its PR. The correct value should bepull/917.description: "test"— Every other entry in perf-changelog.yaml has a meaningful description of what changed. This should describe the actual change: adding--speculative-num-draft-tokens 3to MTP flags for DeepSeek models in the AMD multi-node disagg config.config-keys— Lists onlydsr1-fp4-mi355x-sglang-disagg-mtpanddsr1-fp8-mi355x-sglang-disagg-mtp. The models.yaml diff modifiesmtp_flagsfor all 6 DeepSeek model entries (V3, V3-0324, R1, R1-0528, R1-0528-MXFP4-Preview, R1-0528-MXFP4).
Config-keys scope analysis
One counterargument is that mtp_flags are only applied when DECODE_MTP_SIZE > 0 (i.e., only MTP-enabled benchmark configs consume them), so non-MTP disagg configs are unaffected. This is a valid point — the two listed config-keys may indeed be the only currently existing benchmark configurations that use these flags. If no other disagg-mtp benchmark configs exist for DeepSeek V3/V3-0324, then the config-keys list could be correct. The author should verify whether any other MTP-enabled benchmarks reference these model entries.
Step-by-step proof
- Open
perf-changelog.yamland look at line 970:pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. - This is PR test #917, so the link should end in
/pull/917. - Look at line 968:
description: - "test". Compare with any other entry, e.g., the entry for PR H200 MINIMAX vLLM extend configs #869 which reads"Extend MiniMax M2.5 FP8 single-node config for H200 with vLLM v0.16.0 (TP8)". The description"test"provides zero information. - The
models.yamldiff shows--speculative-num-draft-tokens 3added to all 6 DeepSeek modelmtp_flagsentries, but the changelog only references 2 disagg-mtp config-keys.
Impact
If merged as-is, the changelog would contain a broken PR link (pull/XXX leads to a 404), a meaningless description ("test"), and potentially incomplete config-key coverage. Anyone reviewing the performance changelog would not understand what changed or be able to trace it back to the PR.
How to fix
- Change
pr-linktohttps://github.com/SemiAnalysisAI/InferenceX/pull/917. - Replace the description with something like:
"Add --speculative-num-draft-tokens 3 to MTP flags for all DeepSeek models in AMD multi-node disagg config". - Verify config-keys: confirm whether the two listed disagg-mtp configs are the only MTP-enabled benchmarks consuming these flags, and add any missing ones.
Note: there are 5 pre-existing pull/XXX entries in the file (e.g., H200 SGLang v0.5.9 entry, GLM-5, Qwen3.5 entries), so this is an established pattern for WIP entries. The PR title being "test" suggests this may be a draft, but these placeholders should be resolved before merge.
| DeepSeek-V3: | ||
| base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori" | ||
| mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1" | ||
| mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3" |
There was a problem hiding this comment.
🟡 Adding --speculative-num-draft-tokens 3 to mtp_flags here is a no-op because server.sh:289 already appends --speculative-num-draft-tokens $((decode_mtp_size + 1)) after expanding MODEL_MTP_FLAGS. Since SGLang uses argparse (last value wins), this YAML value is always silently overridden. Remove the flag from either the YAML or the dynamic computation in server.sh to avoid the duplication.
Extended reasoning...
What the bug is
This PR adds --speculative-num-draft-tokens 3 to the mtp_flags field for all 6 DeepSeek model configs in models.yaml. However, the consumer of this field — build_server_config() in server.sh — already appends its own --speculative-num-draft-tokens dynamically, causing the flag to appear twice on the final command line.
The specific code path
At server.sh:289, when MTP is enabled (decode_mtp_size > 0), the function constructs:
mtp_config="${MODEL_MTP_FLAGS} --speculative-num-steps ${decode_mtp_size} --speculative-num-draft-tokens $((decode_mtp_size + 1))"MODEL_MTP_FLAGS is populated from the YAML mtp_flags field, which after this PR expands to:
--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3
So the resulting mtp_config string becomes:
--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --speculative-num-steps 2 --speculative-num-draft-tokens 3
The flag --speculative-num-draft-tokens appears at positions from both YAML and the dynamic computation.
Step-by-step proof with a concrete example
- User sets
DECODE_MTP_SIZE=2(the common case). MODEL_MTP_FLAGS="--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"(from YAML).server.sh:289computes$((decode_mtp_size + 1))=3.- Final
mtp_config="--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --speculative-num-steps 2 --speculative-num-draft-tokens 3". - Both values happen to be
3, so no functional difference — but the duplication is still present.
Now consider DECODE_MTP_SIZE=1:
MODEL_MTP_FLAGSstill contains--speculative-num-draft-tokens 3(hardcoded in YAML).server.sh:289computes$((1 + 1))=2.- Final command line:
--speculative-num-draft-tokens 3 ... --speculative-num-draft-tokens 2. - SGLang’s argparse sees both, last value wins → uses
2, silently ignoring the YAML’s3.
Why existing code doesn’t prevent it
There is no deduplication logic in server.sh. The script simply concatenates MODEL_MTP_FLAGS with the dynamically computed flags. The YAML schema documentation does not warn against including flags that server.sh already manages.
Impact
The YAML change is effectively dead code for the server.sh code path (which is the only consumer of mtp_flags — confirmed via grep). For DECODE_MTP_SIZE=2 there is no behavioral difference, but the duplication is misleading: it gives the false impression that the YAML controls this parameter, when in reality server.sh always overrides it. For any other MTP size, the values would silently conflict.
How to fix
Remove --speculative-num-draft-tokens 3 from the mtp_flags in models.yaml (preferred, since the dynamic computation in server.sh correctly derives it from decode_mtp_size). Alternatively, remove the dynamic --speculative-num-draft-tokens computation from server.sh:289 if the intent is to let each model config control the value.
|
@billishyahao counterintuitively, although theoretical modelling shows that AMD's MTP=2/3 should performance between than AMD's MTP=1, in the current stack, AMD's MTP=3 performs worse than MTP=1. But since ur boss is saying AMD should go back to MTP=2/3, we can change it to that...
|

No description provided.