Skip to content

[Don't Merge] Update cli args qwen#946

Closed
zhentaocc wants to merge 4 commits intoSemiAnalysisAI:mainfrom
zhentaocc:update_cli_args_qwen
Closed

[Don't Merge] Update cli args qwen#946
zhentaocc wants to merge 4 commits intoSemiAnalysisAI:mainfrom
zhentaocc:update_cli_args_qwen

Conversation

@zhentaocc
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@zhentaocc zhentaocc force-pushed the update_cli_args_qwen branch 2 times, most recently from 7992757 to a8cf15f Compare March 25, 2026 19:54
@zhentaocc zhentaocc marked this pull request as draft March 26, 2026 06:26
@functionstackx
Copy link
Copy Markdown
Contributor

to double check, @chunfangamd is @zhentaocc part of AMD? can u confirm internally? if so, plz add him to the upstream repo. better developer experience to create branches in upstream then do forks. for example, forks, we can do sweep-enabled label to validate PRs

@zhentaocc
Copy link
Copy Markdown
Collaborator Author

zhentaocc commented Mar 30, 2026

BF16 local results

conc 64, 1k1k, TPUT 501.29->661.46 tokens/s/gpu, 31.95% boost. @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  222.97    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    587369    
Request throughput (req/s):              2.87      
Input token throughput (tok/s):          2649.88   
Output token throughput (tok/s):         2641.81   
Peak output token throughput (tok/s):    3137.00   
Peak concurrent requests:                80        
Total token throughput (tok/s):          5291.69   
Concurrency:                             62.41     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21744.73  
Median E2E Latency (ms):                 21584.66  
P90 E2E Latency (ms):                    24098.07  
P99 E2E Latency (ms):                    25117.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          500.13    
Median TTFT (ms):                        475.10    
P99 TTFT (ms):                           1085.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.11     
Median TPOT (ms):                        23.18     
P99 TPOT (ms):                           24.63     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           23.11     
Median ITL (ms):                         20.92     
P95 ITL (ms):                            22.33     
P99 ITL (ms):                            106.41    
Max ITL (ms):                            1403.09   
==================================================

Chen, Todd added 3 commits March 30, 2026 03:22
* Added CONTEXT_LENGTH and MAX_PREFILL_TOKENS variables for better configuration.
* Updated launch_server command with new options: --tokenizer-worker-num, --enable-aiter-allreduce-fusion, --cuda-graph-max-bs, --context-length, --disable-radix-cache, --max-prefill-tokens, and --scheduler-recv-interval.
… benchmark configurations for MI355X, enhancing performance with updated CLI arguments.
….yaml to v0.5.9, ensuring compatibility with recent changes.
@zhentaocc
Copy link
Copy Markdown
Collaborator Author

zhentaocc commented Mar 30, 2026

FP8 local test results

conc 64, 1k1k, TPUT 708.75tokens/s/gpu @functionstackx

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     640       
Benchmark duration (s):                  209.64    
Total input tokens:                      590851    
Total input text tokens:                 590851    
Total generated tokens:                  589052    
Total generated tokens (retokenized):    554942    
Request throughput (req/s):              3.05      
Input token throughput (tok/s):          2818.38   
Output token throughput (tok/s):         2809.80   
Peak output token throughput (tok/s):    3682.00   
Peak concurrent requests:                82        
Total token throughput (tok/s):          5628.19   
Concurrency:                             62.31     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20411.98  
Median E2E Latency (ms):                 20282.34  
P90 E2E Latency (ms):                    23190.49  
P99 E2E Latency (ms):                    26606.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          455.92    
Median TTFT (ms):                        423.01    
P99 TTFT (ms):                           1617.35   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.71     
Median TPOT (ms):                        21.62     
P99 TPOT (ms):                           26.91     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.71     
Median ITL (ms):                         19.30     
P95 ITL (ms):                            20.90     
P99 ITL (ms):                            89.05     
Max ITL (ms):                            3259.61   
==================================================

@zhentaocc
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23735484968
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: fa3b1fb
Approval: not required (trusted collaborator).

@zhentaocc zhentaocc marked this pull request as ready for review March 30, 2026 08:33
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@zhentaocc
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@zhentaocc Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23735797269
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys qwen3.5-bf16-mi355x-sglang qwen3.5-fp8-mi355x-sglang
Pinned ref: f0fd6c9
Approval: not required (trusted collaborator).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants