Log per-server inference metrics#2650
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 78dd289. Configure here.
| """Do not keep logging stale per-server metrics when a server fails to respond.""" | ||
| for key in list(smoothed_metrics): | ||
| if key.startswith("inference/server/") and key not in current_metrics: | ||
| del smoothed_metrics[key] |
There was a problem hiding this comment.
Stale drop omits active server metrics
Medium Severity
drop_stale_server_metrics removes any smoothed key under inference/server/ that is absent from the current poll’s metrics dict. build_scope_metrics only adds many per-server fields conditionally (throughput, histogram averages, KV cache stats, cache aliases), so a still-responding server can omit keys on a given cycle while remaining in active_server_names. Those smoothed series are then dropped from the W&B payload even though the server is up, causing gaps and misleading dashboards.
Reviewed by Cursor Bugbot for commit 78dd289. Configure here.
78dd289 to
6e1758e
Compare


Summary
inference/server/<server>/....../upTesting
/shared/research-prod/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/inference_metrics.py tests/unit/orchestrator/test_inference_metrics.pyPYTHONPATH=src:packages/prime-rl-configs/src /shared/research-prod/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_inference_metrics.pyNote:
uv runwas not usable in the standalone/tmpworktree because workspace submodules were not initialized there, so I used the existing repo venv against the/tmpworktree.Note
Low Risk
Observability-only changes to metrics polling and W&B logging; no changes to inference serving or training control paths.
Overview
Extends the orchestrator inference metrics collector so W&B gets per-server scopes alongside existing aggregate and prefill/decode role scopes.
Each admin client gets a stable
server_XX_<host>_<port>name; metrics are logged underinference/server/<name>/...from that endpoint alone.inference/server/<name>/upis emitted for every configured server (1 when the latest poll returned metrics, 0 otherwise), including polls with no successful samples. Smoothed per-server keys are removed when a server stops responding so dashboards do not show stale values.KV cache naming is duplicated for convenience: prefix hit rates alias to
kv_cache_hit_rate/cpu_kv_cache_hit_rate, and usage mean/max map to remaining capacity metrics (kv_cache_left_perc_*,cpu_kv_cache_left_perc_*) via1 - usage.Reviewed by Cursor Bugbot for commit 6e1758e. Bugbot is set up for automated code reviews on this repo. Configure here.