Skip to content
Open

test #917

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions benchmarks/multi_node/amd_utils/models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

DeepSeek-V3:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Adding --speculative-num-draft-tokens 3 to mtp_flags here is a no-op because server.sh:289 already appends --speculative-num-draft-tokens $((decode_mtp_size + 1)) after expanding MODEL_MTP_FLAGS. Since SGLang uses argparse (last value wins), this YAML value is always silently overridden. Remove the flag from either the YAML or the dynamic computation in server.sh to avoid the duplication.

Extended reasoning...

What the bug is

This PR adds --speculative-num-draft-tokens 3 to the mtp_flags field for all 6 DeepSeek model configs in models.yaml. However, the consumer of this field — build_server_config() in server.sh — already appends its own --speculative-num-draft-tokens dynamically, causing the flag to appear twice on the final command line.

The specific code path

At server.sh:289, when MTP is enabled (decode_mtp_size > 0), the function constructs:

mtp_config="${MODEL_MTP_FLAGS} --speculative-num-steps ${decode_mtp_size} --speculative-num-draft-tokens $((decode_mtp_size + 1))"

MODEL_MTP_FLAGS is populated from the YAML mtp_flags field, which after this PR expands to:
--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3

So the resulting mtp_config string becomes:
--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --speculative-num-steps 2 --speculative-num-draft-tokens 3

The flag --speculative-num-draft-tokens appears at positions from both YAML and the dynamic computation.

Step-by-step proof with a concrete example

  1. User sets DECODE_MTP_SIZE=2 (the common case).
  2. MODEL_MTP_FLAGS = "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3" (from YAML).
  3. server.sh:289 computes $((decode_mtp_size + 1)) = 3.
  4. Final mtp_config = "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --speculative-num-steps 2 --speculative-num-draft-tokens 3".
  5. Both values happen to be 3, so no functional difference — but the duplication is still present.

Now consider DECODE_MTP_SIZE=1:

  1. MODEL_MTP_FLAGS still contains --speculative-num-draft-tokens 3 (hardcoded in YAML).
  2. server.sh:289 computes $((1 + 1)) = 2.
  3. Final command line: --speculative-num-draft-tokens 3 ... --speculative-num-draft-tokens 2.
  4. SGLang’s argparse sees both, last value wins → uses 2, silently ignoring the YAML’s 3.

Why existing code doesn’t prevent it

There is no deduplication logic in server.sh. The script simply concatenates MODEL_MTP_FLAGS with the dynamically computed flags. The YAML schema documentation does not warn against including flags that server.sh already manages.

Impact

The YAML change is effectively dead code for the server.sh code path (which is the only consumer of mtp_flags — confirmed via grep). For DECODE_MTP_SIZE=2 there is no behavioral difference, but the duplication is misleading: it gives the false impression that the YAML controls this parameter, when in reality server.sh always overrides it. For any other MTP size, the values would silently conflict.

How to fix

Remove --speculative-num-draft-tokens 3 from the mtp_flags in models.yaml (preferred, since the dynamic computation in server.sh correctly derives it from decode_mtp_size). Alternatively, remove the dynamic --speculative-num-draft-tokens computation from server.sh:289 if the intent is to let each model config control the value.

dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down Expand Up @@ -70,7 +70,7 @@ DeepSeek-V3:

DeepSeek-V3-0324:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down Expand Up @@ -101,7 +101,7 @@ DeepSeek-V3-0324:

DeepSeek-R1:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down Expand Up @@ -132,7 +132,7 @@ DeepSeek-R1:

DeepSeek-R1-0528:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down Expand Up @@ -163,7 +163,7 @@ DeepSeek-R1-0528:

DeepSeek-R1-0528-MXFP4-Preview:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down Expand Up @@ -194,7 +194,7 @@ DeepSeek-R1-0528-MXFP4-Preview:

DeepSeek-R1-0528-MXFP4:
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1 --speculative-num-draft-tokens 3"
dp_flags: "--moe-a2a-backend mori --deepep-mode normal --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
prefill:
mem_fraction_static: 0.8
Expand Down
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -961,3 +961,10 @@
description:
- "Extend MiniMax M2.5 FP8 single-node config for H200 with vLLM v0.16.0 (TP8)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/869

- config-keys:
- dsr1-fp4-mi355x-sglang-disagg-mtp
- dsr1-fp8-mi355x-sglang-disagg-mtp
description:
- "test"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Comment on lines +964 to +970
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has placeholder values: pr-link is pull/XXX (should be pull/917), description is just "test" instead of describing the actual change (adding --speculative-num-draft-tokens 3 to MTP flags), and the entry overall is not ready for merge. Please update the PR link, write a meaningful description, and verify the config-keys cover all affected benchmarks.

Extended reasoning...

What the bug is

The perf-changelog entry added at lines 964-970 contains three placeholder values that are clearly not production-ready:

  1. pr-link: pull/XXX — This is a placeholder URL. Every changelog entry needs a working link to its PR. The correct value should be pull/917.
  2. description: "test" — Every other entry in perf-changelog.yaml has a meaningful description of what changed. This should describe the actual change: adding --speculative-num-draft-tokens 3 to MTP flags for DeepSeek models in the AMD multi-node disagg config.
  3. config-keys — Lists only dsr1-fp4-mi355x-sglang-disagg-mtp and dsr1-fp8-mi355x-sglang-disagg-mtp. The models.yaml diff modifies mtp_flags for all 6 DeepSeek model entries (V3, V3-0324, R1, R1-0528, R1-0528-MXFP4-Preview, R1-0528-MXFP4).

Config-keys scope analysis

One counterargument is that mtp_flags are only applied when DECODE_MTP_SIZE > 0 (i.e., only MTP-enabled benchmark configs consume them), so non-MTP disagg configs are unaffected. This is a valid point — the two listed config-keys may indeed be the only currently existing benchmark configurations that use these flags. If no other disagg-mtp benchmark configs exist for DeepSeek V3/V3-0324, then the config-keys list could be correct. The author should verify whether any other MTP-enabled benchmarks reference these model entries.

Step-by-step proof

  1. Open perf-changelog.yaml and look at line 970: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  2. This is PR test #917, so the link should end in /pull/917.
  3. Look at line 968: description: - "test". Compare with any other entry, e.g., the entry for PR H200 MINIMAX vLLM extend configs  #869 which reads "Extend MiniMax M2.5 FP8 single-node config for H200 with vLLM v0.16.0 (TP8)". The description "test" provides zero information.
  4. The models.yaml diff shows --speculative-num-draft-tokens 3 added to all 6 DeepSeek model mtp_flags entries, but the changelog only references 2 disagg-mtp config-keys.

Impact

If merged as-is, the changelog would contain a broken PR link (pull/XXX leads to a 404), a meaningless description ("test"), and potentially incomplete config-key coverage. Anyone reviewing the performance changelog would not understand what changed or be able to trace it back to the PR.

How to fix

  • Change pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/917.
  • Replace the description with something like: "Add --speculative-num-draft-tokens 3 to MTP flags for all DeepSeek models in AMD multi-node disagg config".
  • Verify config-keys: confirm whether the two listed disagg-mtp configs are the only MTP-enabled benchmarks consuming these flags, and add any missing ones.

Note: there are 5 pre-existing pull/XXX entries in the file (e.g., H200 SGLang v0.5.9 entry, GLM-5, Qwen3.5 entries), so this is an established pattern for WIP entries. The PR title being "test" suggests this may be a draft, but these placeholders should be resolved before merge.