fix: synchronize ACP telemetry and refresh remote final state#2460
fix: synchronize ACP telemetry and refresh remote final state#2460simonrosenberg wants to merge 2 commits intomainfrom
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ❌ FAILEDResult: ❌ FAILED
Log excerpt (first 1000 characters) |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Core telemetry fix is solid, but bundled breaking changes need attention
Verdict: ❌ Needs documentation - The synchronization fix is correct, but two undocumented breaking changes (retry removal + hook regression) should be called out in the PR description or split into separate PRs.
Key Insight: Moving cost recording to a single synchronized path after UsageUpdate receipt is the right fix for zero-cost telemetry. The per-session tracking is cleaner than global state. However, removing ~150 lines of retry logic and changing hook behavior are significant changes that deserve explicit justification.
| # PromptResponse, so under normal conditions the notification handler | ||
| # completes almost immediately. This timeout is a safety net for slow | ||
| # or remote servers. | ||
| _USAGE_UPDATE_TIMEOUT: float = float( |
There was a problem hiding this comment.
🔴 Critical - Undocumented Breaking Change: The PR completely removes connection retry logic (previously handled ConnectionError, BrokenPipeError, EOFError with 3 retries + exponential backoff). This timeout constant replacement is good, but the retry removal is undocumented.
Why this matters: Production ACP servers can have temporary network blips, container restarts, or connection resets. The old code tolerated these via retry. The new code will fail immediately.
What is needed: Either (1) explain in the PR description why retry is no longer needed, or (2) restore retry logic, or (3) split this removal into a separate PR with justification.
| elapsed: Wall-clock seconds for this prompt round-trip (optional). | ||
| usage_update: The synchronized ACP UsageUpdate for this turn, if any. | ||
| """ | ||
| if usage_update is not None and usage_update.cost is not None: |
There was a problem hiding this comment.
🟢 Acceptable: Single telemetry recording path is exactly what was needed. Cost/tokens/latency all processed in one place after synchronization eliminates the split-brain problem described in the issue. Per-session cost tracking (_last_cost_by_session) correctly handles multiple concurrent sessions.
| self._session_id, | ||
| ) | ||
| await _drain_notifications() | ||
| if self._client.get_turn_usage_update(self._session_id or "") is None: |
There was a problem hiding this comment.
🟢 Acceptable - Good Taste: This synchronization approach is correct. Preparing the event before prompt(), waiting for UsageUpdate notification if not already received, then processing in a single path eliminates the race condition. The 2.0s timeout is reasonable (server writes UsageUpdate before PromptResponse per ACP protocol). Configurable via env var is pragmatic.
| HookEventType.USER_PROMPT_SUBMIT, | ||
| HookEventType.STOP, | ||
| ) | ||
| if any(hook_config.has_hooks_for_event(t) for t in unsupported): |
There was a problem hiding this comment.
🔴 Critical - Breaking Change: This changes hook behavior from "all hooks executed server-side" to "only SessionStart/SessionEnd supported locally, others emit warning but do not execute".
What breaks: Code relying on PRE_TOOL_USE, POST_TOOL_USE, USER_PROMPT_SUBMIT, or STOP hooks will silently stop working (just a warning in logs).
Why risky: This is a functional regression unrelated to the telemetry fix. It should be: (1) Documented in the PR description as a breaking change, OR (2) Split into a separate PR with justification for the regression.
all-hands-bot
left a comment
There was a problem hiding this comment.
Additional feedback on test coverage:
| mock_response = MagicMock() | ||
| mock_response.usage = None | ||
| assert len(agent.llm.metrics.token_usages) == 0 | ||
|
|
There was a problem hiding this comment.
🟡 Suggestion - Test Gap: The PR tests when UsageUpdate is completely absent (cost=None in fixture), but not when it times out after 2.0s. Consider adding a test that verifies the timeout path:
def test_step_records_partial_metrics_on_usage_timeout(self, tmp_path):
"""Timeout waiting for UsageUpdate logs warning but records available token metrics."""
# Setup: executor returns response but never populates turn_usage_updates
# Assert: warning logged, token_usages recorded from PromptResponse, but costs remain at 0This would verify graceful degradation when the ACP server is slow/buggy and UsageUpdate never arrives before timeout.
|
Addressed the remaining review items:
Verification:
|
Fixes #2375
This implements the fix direction from the issue discussion:
ACPAgentsession_update()UsageUpdatebefore recording cost/tokens/latencyrun()returnsWhy:
Latest zero-cost ACP benchmark rows were caused by two separate correctness problems:
RemoteConversationcould return from REST fallback with stale cached state, leavingconversation_statsat zero even when the server had final stats.Tests:
PYTHONPATH=/tmp/sdk-issue-2375/openhands-sdk${PYTHONPATH:+:$PYTHONPATH} pytest tests/sdk/agent/test_acp_agent.py tests/sdk/conversation/remote/test_remote_conversation.pyAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:d61f7eb-pythonRun
All tags pushed for this build
About Multi-Architecture Support
d61f7eb-python) is a multi-arch manifest supporting both amd64 and arm64d61f7eb-python-amd64) are also available if needed