[Don't Merge] Update cli args qwen by zhentaocc · Pull Request #946 · SemiAnalysisAI/InferenceX

zhentaocc · 2026-03-25T19:50:27Z

No description provided.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

functionstackx · 2026-03-27T06:39:13Z

to double check, @chunfangamd is @zhentaocc part of AMD? can u confirm internally? if so, plz add him to the upstream repo. better developer experience to create branches in upstream then do forks. for example, forks, we can do sweep-enabled label to validate PRs

zhentaocc · 2026-03-30T08:18:04Z

BF16 local results

conc 64, 1k1k, TPUT 501.29->661.46 tokens/s/gpu, 31.95% boost. @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  222.97    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587369    
Request throughput (req/s):              2.87      
Input token throughput (tok/s):          2649.88   
Output token throughput (tok/s):         2641.81   
Peak output token throughput (tok/s):    3137.00   
Peak concurrent requests:                80        
Total token throughput (tok/s):          5291.69   
Concurrency:                             62.41     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21744.73  
Median E2E Latency (ms):                 21584.66  
P90 E2E Latency (ms):                    24098.07  
P99 E2E Latency (ms):                    25117.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          500.13    
Median TTFT (ms):                        475.10    
P99 TTFT (ms):                           1085.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.11     
Median TPOT (ms):                        23.18     
P99 TPOT (ms):                           24.63     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           23.11     
Median ITL (ms):                         20.92     
P95 ITL (ms):                            22.33     
P99 ITL (ms):                            106.41    
Max ITL (ms):                            1403.09   
==================================================

* Added CONTEXT_LENGTH and MAX_PREFILL_TOKENS variables for better configuration. * Updated launch_server command with new options: --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, --context-length, --disable-radix-cache, --max-prefill-tokens, and --scheduler-recv-interval.

… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.

….yaml to v0.5.9, ensuring compatibility with recent changes.

zhentaocc · 2026-03-30T08:31:24Z

FP8 local test results

conc 64, 1k1k, TPUT 708.75tokens/s/gpu @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  209.64    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    554942    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          2818.38   
Output token throughput (tok/s):         2809.80   
Peak output token throughput (tok/s):    3682.00   
Peak concurrent requests:                82        
Total token throughput (tok/s):          5628.19   
Concurrency:                             62.31     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20411.98  
Median E2E Latency (ms):                 20282.34  
P90 E2E Latency (ms):                    23190.49  
P99 E2E Latency (ms):                    26606.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          455.92    
Median TTFT (ms):                        423.01    
P99 TTFT (ms):                           1617.35   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.71     
Median TPOT (ms):                        21.62     
P99 TPOT (ms):                           26.91     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.71     
Median ITL (ms):                         19.30     
P95 ITL (ms):                            20.90     
P99 ITL (ms):                            89.05     
Max ITL (ms):                            3259.61   
==================================================

zhentaocc · 2026-03-30T08:33:14Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

github-actions · 2026-03-30T08:33:24Z

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23735484968
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: fa3b1fb
Approval: not required (trusted collaborator).

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

zhentaocc · 2026-03-30T08:41:07Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

github-actions · 2026-03-30T08:41:18Z

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23735797269
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: f0fd6c9
Approval: not required (trusted collaborator).

zhentaocc requested a review from a team March 25, 2026 19:50

github-project-automation bot added this to InferenceMAX Board Mar 25, 2026

zhentaocc requested review from billishyahao and chunfangamd as code owners March 25, 2026 19:50

claude bot reviewed Mar 25, 2026

View reviewed changes

zhentaocc force-pushed the update_cli_args_qwen branch 2 times, most recently from 7992757 to a8cf15f Compare March 25, 2026 19:54

zhentaocc marked this pull request as draft March 26, 2026 06:26

Chen, Todd added 3 commits March 30, 2026 03:22

Update perf-changelog.yaml to include new Qwen3.5 FP8 and BF16 SGLang…

359ff45

… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.

Update SGLang image versions for Qwen3.5 configurations in amd-master…

fa3b1fb

….yaml to v0.5.9, ensuring compatibility with recent changes.

zhentaocc force-pushed the update_cli_args_qwen branch from a8cf15f to fa3b1fb Compare March 30, 2026 08:23

zhentaocc self-assigned this Mar 30, 2026

zhentaocc added AMD sweep-enabled labels Mar 30, 2026

zhentaocc marked this pull request as ready for review March 30, 2026 08:33

claude bot reviewed Mar 30, 2026

View reviewed changes

use 0327 build

f0fd6c9

zhentaocc closed this Mar 30, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't Merge] Update cli args qwen#946

[Don't Merge] Update cli args qwen#946
zhentaocc wants to merge 4 commits intoSemiAnalysisAI:mainfrom
zhentaocc:update_cli_args_qwen

zhentaocc commented Mar 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

functionstackx commented Mar 27, 2026

Uh oh!

zhentaocc commented Mar 30, 2026 •

edited

Loading

Uh oh!

zhentaocc commented Mar 30, 2026 •

edited

Loading

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

claude bot left a comment

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhentaocc commented Mar 25, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

functionstackx commented Mar 27, 2026

Uh oh!

zhentaocc commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BF16 local results

Uh oh!

zhentaocc commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FP8 local test results

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

zhentaocc commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhentaocc commented Mar 30, 2026 •

edited

Loading

zhentaocc commented Mar 30, 2026 •

edited

Loading