Log per-server inference metrics by samsja · Pull Request #2650 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-27T03:09:25Z

Summary

add per-server inference metric scopes under inference/server/<server>/...
add direct server liveness metrics via .../up
add KV cache hit aliases and remaining KV cache metrics
keep aggregate and P/D role-aware metrics intact

Testing

/shared/research-prod/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/inference_metrics.py tests/unit/orchestrator/test_inference_metrics.py
PYTHONPATH=src:packages/prime-rl-configs/src /shared/research-prod/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_inference_metrics.py

Note: uv run was not usable in the standalone /tmp worktree because workspace submodules were not initialized there, so I used the existing repo venv against the /tmp worktree.

Note

Low Risk
Observability-only changes to metrics polling and W&B logging; no changes to inference serving or training control paths.

Overview
Extends the orchestrator inference metrics collector so W&B gets per-server scopes alongside existing aggregate and prefill/decode role scopes.

Each admin client gets a stable server_XX_<host>_<port> name; metrics are logged under inference/server/<name>/... from that endpoint alone. inference/server/<name>/up is emitted for every configured server (1 when the latest poll returned metrics, 0 otherwise), including polls with no successful samples. Smoothed per-server keys are removed when a server stops responding so dashboards do not show stale values.

KV cache naming is duplicated for convenience: prefix hit rates alias to kv_cache_hit_rate / cpu_kv_cache_hit_rate, and usage mean/max map to remaining capacity metrics (kv_cache_left_perc_*, cpu_kv_cache_left_perc_*) via 1 - usage.

^{Reviewed by Cursor Bugbot for commit 6e1758e. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 78dd289. Configure here.}

cursor · 2026-05-27T03:10:10Z

+        """Do not keep logging stale per-server metrics when a server fails to respond."""
+        for key in list(smoothed_metrics):
+            if key.startswith("inference/server/") and key not in current_metrics:
+                del smoothed_metrics[key]


Stale drop omits active server metrics

Medium Severity

drop_stale_server_metrics removes any smoothed key under inference/server/ that is absent from the current poll’s metrics dict. build_scope_metrics only adds many per-server fields conditionally (throughput, histogram averages, KV cache stats, cache aliases), so a still-responding server can omit keys on a given cycle while remaining in active_server_names. Those smoothed series are then dropped from the W&B payload even though the server is up, causing gaps and misleading dashboards.

^{Reviewed by Cursor Bugbot for commit 78dd289. Configure here.}

cursor Bot reviewed May 27, 2026

View reviewed changes

Log per-server inference metrics

6e1758e

samsja force-pushed the codex/per-server-inference-metrics branch from 78dd289 to 6e1758e Compare May 27, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log per-server inference metrics#2650

Log per-server inference metrics#2650
samsja wants to merge 1 commit into
mainfrom
codex/per-server-inference-metrics

samsja commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Stale drop omits active server metrics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented May 27, 2026 •

edited by cursor Bot

Loading