Skip to content

Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933

Open
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm-mtp
Open

Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm-mtp

Conversation

@Ankur-singh

Copy link
Copy Markdown
Collaborator

Adds the minimaxm3-fp4-b200-vllm-mtp config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc) with EAGLE3 speculative decoding (spec-decoding: mtp, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3).

  • Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc), every search-space row spec-decoding: mtp; sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512.
  • Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --speculative-config EAGLE3; --block-size 128 (MSA), --language-model-only; prompts routed through the chat template.
  • Weights: target pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh. The EAGLE3 draft (not staged) is fetched next to the target weights, as in the fp8 b200 MTP recipe.
  • perf-changelog entry appended.

… decode

New minimaxm3-fp4-b200-vllm-mtp config (fp4 vLLM aggregated on b200-dgxc with
EAGLE3 speculative decoding, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3). The
benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4
support, commit 6c08558) before serve and routes prompts through the chat template.
Target weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a
minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); the EAGLE3 draft is
fetched next to the target weights.
Comment on lines +26 to +32
VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))")
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The patch step (lines 26-34) downloads three vLLM source files via curl and runs a Python import as a sanity check, but neither step has a fail-exit guard and the script does not set -e (nor does benchmark_lib.sh). If a download fails (network blip, GitHub raw outage, commit moved) or the import fails on a half-patched install, the script proceeds to vllm serve with broken/partially-patched code, producing a confusing serve-time error instead of a clean setup failure. Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern: wrap the curl loop and the verify in || { echo "nvfp4 patch failed" >&2; exit 1; }.

Extended reasoning...

What the bug is

benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh:26-34 overlays three files from vllm-project/vllm PR #46380 (commit 6c08558) onto the installed vLLM package, then runs a single-import sanity check:

for f in \
  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
  model_executor/layers/quantization/modelopt.py \
  model_executor/layers/quantization/utils/flashinfer_utils.py
do
  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

Neither the loop body nor the python3 verification has a || exit 1 guard, and there is no set -e either in this script or inherited from benchmark_lib.sh (the only set -e in benchmark_lib.sh is locally scoped inside one helper function, not at file level).

Code path that triggers it

  1. curl -fsSL returns non-zero on HTTP error (e.g. raw.githubusercontent.com 5xx, the commit getting rewritten, the file getting renamed, or a network blip).
  2. Without set -e, the failed curl invocation does not abort the loop — it just moves on to the next iteration, leaving a stale or missing file behind.
  3. The python3 -c line then imports exactly one of the three patched modules (TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe). If that one import succeeds but modelopt.py or flashinfer_utils.py was not refreshed, the sanity check still prints [nvfp4-patch] OK.
  4. Even when the import itself fails, python3 exits non-zero but the script proceeds straight to vllm serve.

Why existing code does not prevent it

  • set -e is not enabled at the script level (only set -x is set later, which is shell trace, not error-exit).
  • benchmark_lib.sh does not set -e globally; the only set -e is local to a helper function (around line 1270).
  • curl -f only affects curl's own exit code; it does not propagate to the shell unless combined with an exit guard.

Impact

A transient GH raw outage, a force-pushed commit, or a renamed file produces a confusing crash deep inside vllm serve (e.g. "unrecognised quantization config" or an ImportError from inside the engine) instead of a clean [nvfp4-patch] failed at script startup. Diagnosis costs an extra round-trip through engine logs for what is fundamentally a setup-time download failure.

How to fix

Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern:

for f in \
  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
  model_executor/layers/quantization/modelopt.py \
  model_executor/layers/quantization/utils/flashinfer_utils.py
do
  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" \
    || { echo "[nvfp4-patch] curl failed for ${f}" >&2; exit 1; }
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" \
  || { echo "[nvfp4-patch] import verify failed" >&2; exit 1; }

Step-by-step proof

Concrete failure walk-through: suppose raw.githubusercontent.com returns a 503 for modelopt.py (transient outage):

  1. Iteration 1: curl for trtllm_nvfp4_moe.py succeeds.
  2. Iteration 2: curl for modelopt.py exits non-zero, but with no set -e and no || exit, the shell loop advances. The stale (pre-patch) modelopt.py from the base image remains on disk.
  3. Iteration 3: curl for flashinfer_utils.py succeeds.
  4. The verify line imports TrtLlmNvFp4ExpertsModular — which was successfully overwritten — and prints [nvfp4-patch] OK. The half-patched state goes undetected.
  5. vllm serve starts, hits the unpatched modelopt.py when loading the NVFP4 quant config, and crashes with an opaque engine error.

With the fix, step 2 immediately prints [nvfp4-patch] curl failed for model_executor/layers/quantization/modelopt.py and exits 1 — a one-line setup failure instead of an engine-internal crash. Severity is nit because the script still ultimately errors; the failure is just less localised and slower to diagnose.

Comment thread perf-changelog.yaml Outdated
- "Image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223; benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; prompts routed through the chat template"
- "Target weights pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); EAGLE3 draft fetched next to the target weights; --block-size 128 (MSA), --language-model-only"
- "Sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry at perf-changelog.yaml:4194 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the XXX is a literal placeholder that was never substituted. Every other entry in this file uses the real PR number, so this will ship as a broken link in the released changelog. Trivial fix: replace XXX with 1933 before merge.

Extended reasoning...

What the bug is: The perf-changelog.yaml entry added by this PR (lines 4187-4194) ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nThe XXX is a literal three-character placeholder that should have been substituted with the actual PR number before the change was pushed. This PR is #1933 (per the PR metadata).\n\nWhy this is a defect: perf-changelog.yaml is a structured, machine- and human-readable record of perf changes. Every other entry in the file uses the actual numeric PR id, e.g. line 4185 ends with pull/1927, line 4176 with pull/1762, and earlier entries reference 1865, 1706, etc. Anyone (or anything) consuming this file — release notes generators, the GitHub UI when the markdown is rendered, humans clicking the link — will land on https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is not a valid PR. GitHub will return a 404.\n\nWhy existing code doesn't prevent it: There is no schema validation on pr-link that enforces a numeric trailing component (the value is just a string), and the PR was opened from a branch where the author hand-edited the changelog entry from a template and forgot to fill in the placeholder. CI passing doesn't catch this kind of metadata typo.\n\nImpact: Once merged, the changelog will contain a permanently broken link. The fix is more disruptive after merge (requires a follow-up PR) than now (one character change to the diff).\n\nStep-by-step proof:\n1. Open perf-changelog.yaml after this PR merges.\n2. Scroll to the bottom (lines 4187-4194), the new minimaxm3-fp4-b200-vllm-mtp entry.\n3. Read pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n4. Click / curl that URL → GitHub returns 404 because XXX is not a valid PR number.\n5. Compare against the immediately preceding entry on line 4185, pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1927 — that one resolves correctly.\n\nHow to fix: In the diff, change line 4194 from:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\nto:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1933\n

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. fix patchwork as discussed in slack
  2. missing vllm recipes

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in
MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the EAGLE3
benchmark script no longer overwrites vLLM files at runtime.
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants