Skip to content

fix(grpc): eliminate request_id collision in host pending responses#7637

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-7016-grpc-request-id-collision
Open

fix(grpc): eliminate request_id collision in host pending responses#7637
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-7016-grpc-request-id-collision

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #7016

Problem

Each GrpcWorkerAgentRuntime starts its own per-session request_id counter
from "1". When two different runtimes both send RPC requests to agents living
on the same target runtime, the host stored pending response futures under:

_pending_responses[target_client_id][request_id]

Because both senders start from "1", both inserts share the same key — the
second overwrites the first, and the subsequent pop() raises KeyError('1').

Concrete topology (from the issue):

  • Runtime-2 → relay_agent (hosted on Runtime-1): request_id = "1"
  • relay_agentinner_agent (also on Runtime-1): request_id = "1" ← collision

Solution

Before forwarding a request to the target runtime, the host now replaces the
sender's request_id with a uuid.uuid4() string. The UUID is used as the
key in _pending_responses so keys are globally unique regardless of how many
senders share the same counter values.

When the response arrives (carrying the UUID), the original request_id is
restored before the future is resolved, so the sender's own pending_requests
map can still match the response.

Changes are limited to _worker_runtime_host_servicer.py:

  • Add import uuid
  • In _process_request: generate a UUID, substitute it in the forwarded
    request, store (future, original_request_id) in _pending_responses
  • In _process_response: look up by UUID, restore original_request_id
  • Update type annotation and client-disconnect cleanup accordingly

Testing

Added test_cross_runtime_rpc_no_request_id_collision in
test_worker_runtime.py that reproduces the exact topology from the bug
report: an external runtime sends to a relay agent that in turn sends to an
inner agent on the same worker runtime, verifying the full chain completes
without KeyError.

…rosoft#7016)

Each GrpcWorkerAgentRuntime starts its own per-session request_id counter
from "1".  When multiple runtimes send RPC requests whose target agent lives
on the same worker, the host stored futures under
  _pending_responses[target_client_id][request_id]
so two in-flight requests with identical request_ids collided: the second
insert overwrote the first entry, and the subsequent pop() raised KeyError.

Fix: before forwarding a request to the target runtime, replace the sender's
request_id with a host-generated UUID.  The UUID is stored as the key in
_pending_responses.  When the response arrives (carrying the UUID as its
request_id), the original sender's request_id is restored so the sender's own
pending_requests map can still match it.

Add a regression test that reproduces the exact topology from the bug report:
runtime2 → relay_agent (on runtime1) → inner_agent (also on runtime1).

Co-Authored-By: Octopus <liyuan851277048@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GrpcWorkerAgentRuntimeHost : _pending_responses key collision when multiple senders target the same runtime with identical request_id → KeyError('1')

2 participants