Skip to content

Conversation

@CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented May 9, 2025

Objective

Align with vLLM v1 metrics system and beyond. Here are several key alignments

  • Monotonic Timestamps:
    -- Uses time.perf_counter() for interval calculations (avoids clock drift issues).
  • Metric Types:
    -- Gauges: Active requests, cache usage, etc
    -- Counters: Token totals, request success / failure counts, etc
    -- Histograms: TTFT (Time-To-First-Token), TPOT (Inter-Token Latency), end-to-end latency, etc
  • Metrics Publishing:
    -- CLI logging
    -- Prometheus & Grafana

We only record critical timestamps and events inside the engine process without further processing. Heavy-weight metrics calculations or publishing are separated from the main loop to minimize overhead.

For convenient Grafana visualization and usage, we align with SGLang.

TODO

  • Refactor: After MP engine, things become different ... the global singleton context will be local for each process. Carrying the information from the engine to the async engine seems the most convenient and less error-prone way to do it; otherwise, we may perform IPC frequently.
  • Refactor: 1. Avoid parameter passing (singleton context), 2. Reduce computation overheads (high CPU overheads, but can be solved with MP engine)
  • Refactor: Decouple prometheus_client, only install / import when needed
  • Update: Add user guide
  • Refactor: Reduce messy parameters, pack things into a class
  • Feature: Grafana visualization
  • Feature: Expert information collections (deferred in another PR)
  • Refactor: Minimize the modifications to async engine generate() and engine _async_loop_main()
  • Fix: Use time.perf_counter()

Usage

Start the server with --enable-metrics

lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --enable-metrics
  • Metrics Publishing - Logging
    With --enable-metrics, key metrics (e.g., finished / unfinished / running / waiting requests, token throughputs, cache usage) are printed to the terminal every 10 seconds.
    cli_log

  • Metrics Publishing - Prometheus & Grafana
    -- Raw Metrics
    Access the raw Prometheus metrics via http://localhost:23333/metrics/ .
    You can also curl the metrics endpoint curl http:///localhost:23333/metrics/ to view raw Prometheus results. No extra setups are required for this step.
    prometheus
    -- Prometheus Panel
    Access the Prometheus panel via http://localhost:9090 (9090 is the current default port for the Prometheus panel). You need extra setups to access the Prometheus panel; please check the user guide for details.
    prometheus_panel
    -- Grafana Panel
    Access the Grafana panel via http://localhost:3000 (3000 is the current default port for the Grafana panel). You need extra setups to access the Grafana panel; please check the user guide for details.
    grafana_panel

Request Timeline

The following diagram depicts how we define and calculate time intervals during the request lifecycle, which adheres to vLLM.
timeline

Performance Impacts

  • Conclusion

Tested with Qwen2.5-0.5B / Qwen2.5-7B / Qwen2.5-32B, no obvious performance impacts. (Requires #3627)

Check the following tables for output throughput details. We conducted tests using 1,000 prompts, with input length 1k and output length 1k. Each model was tested three times to reduce the impact of performance fluctuations.

  • QWen2.5-0.5B, TP1
W/O metrics (tokens/s) W metrics (tokens/s)
20387 20555
20341 20877
20746 20771

  • QWen2.5-7B, TP1
W/O metrics (tokens/s) W metrics (tokens/s)
8836 8721
8780 8736
8800 8723

  • QWen2.5-32B, TP2
W/O metrics (tokens/s) W metrics (tokens/s)
3019 3160
3167 3165
3189 3173

Related Issues & PR

CUHKSZzxy added 2 commits May 9, 2025 20:38
Conflicts:
	lmdeploy/messages.py
	lmdeploy/pytorch/engine/engine.py
	lmdeploy/pytorch/engine/engine_instance.py
	lmdeploy/pytorch/messages.py
	lmdeploy/pytorch/paging/scheduler.py
@CUHKSZzxy CUHKSZzxy added the WIP label May 9, 2025
@CUHKSZzxy CUHKSZzxy removed the WIP label May 26, 2025
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review May 26, 2025 13:24
@CUHKSZzxy CUHKSZzxy removed the WIP label Jul 3, 2025
@lvhan028 lvhan028 mentioned this pull request Jul 7, 2025
- job_name: lmdeploy
static_configs:
- targets:
- '$host_ip:$api_server_port1' # <= Modify this
Copy link
Collaborator

@RunningLeon RunningLeon Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we config all dp server urls in here and show data in grafana board?

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@grimoire grimoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 1e8ce56 into InternLM:main Jul 9, 2025
5 checks passed
@voycey
Copy link

voycey commented Aug 5, 2025

I see this has been merged but --enable-metrics is still not working?

@RunningLeon
Copy link
Collaborator

@voycey Hi, this feature only works for backend=pytorch, yet the default backend is turbomind. Metrics for turbomind backend will be added in another PR.

@voycey
Copy link

voycey commented Aug 5, 2025

Docs dont mention anything about this being limited to PyTorch backend :(

Any ETA on Turbomind metrics? Its running incredibly fast for me and I would like to see what the token / second is on this - is there any other way?

@CUHKSZzxy
Copy link
Collaborator Author

@voycey
Thanks for your feedback, metrics support for turbomind is on the way and will be ready soon.
Check the following PR

@CUHKSZzxy CUHKSZzxy mentioned this pull request Aug 5, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants