-
Notifications
You must be signed in to change notification settings - Fork 182
Description
Summary
Certain GAIA benchmark instances deterministically fail with Remote conversation got stuck across all retries, all SDK versions, and both Claude Sonnet 4.5 and Opus 4.6. The root cause is a conflict between the stuck detector's monologue threshold (3 consecutive agent MessageEvents) and extended thinking models that produce reasoning-only responses (no text, no tool calls) before their first action.
Affected instances
| Instance | Question | Sonnet (run 1) | Sonnet (run 2) | Sonnet (run 3) | Opus (run 4) | Opus (run 5) |
|---|---|---|---|---|---|---|
2d83110e |
Reversed text puzzle | STUCK | STUCK | STUCK | STUCK | STUCK |
2a649bb1 |
EC numbers for virus testing | PASS | STUCK | STUCK | STUCK | STUCK |
6359a0b1 |
Green polygon area | FAIL | FAIL | FAIL | STUCK | STUCK |
983bba7c |
Animals in alvei papers | PASS | PASS | FAIL | STUCK | PASS |
2d83110e fails in 100% of runs (5/5). 2a649bb1 fails in 4/5. The Opus runs are more affected (up to 4 stuck instances vs 1–2 for Sonnet).
Reproduction
Run the GAIA benchmark with Claude Sonnet 4.5 or Opus 4.6 with extended thinking enabled and OpenHands agent, Instance 2d83110e will fail every time.
Root cause
The code path (agent.py:298–382)
- The LLM returns a valid response (
chatcmpl-*ID) containing only thinking/reasoning blocks — no text content, no tool calls. agent.py:302–309detectshas_reasoningand/orhas_content:has_reasoning = ( message.responses_reasoning_item is not None or message.reasoning_content is not None or (message.thinking_blocks and len(message.thinking_blocks) > 0) ) has_content = any( isinstance(c, TextContent) and c.text.strip() for c in message.content )
- Since there are no tool calls, execution falls through to line 356. If neither reasoning nor content is found, a warning is logged. But in all cases, a
MessageEventis emitted unconditionally (lines 359–372):msg_event = MessageEvent(source="agent", llm_message=message, ...) on_event(msg_event) # Emitted even when empty
- Since
has_contentis False,execution_statusis not set to FINISHED (line 379–382), so the agent loop continues and calls the LLM again. - The LLM produces the same reasoning-only response. This repeats 3 times.
The stuck detection (stuck_detector.py:200–225)
_is_stuck_monologue()counts consecutiveMessageEvents withsource="agent":for event in reversed(events): if isinstance(event, MessageEvent): if event.source == "agent": agent_message_count += 1
- The monologue threshold is 3 (
types.py:40–41). After 3 emptyMessageEvents, the conversation is markedSTUCK.
The error propagation (remote_conversation.py:1035–1039)
- The remote client detects the STUCK status and raises
ConversationRunError("Remote conversation got stuck").
Why these specific instances
These questions cause the model to reason internally without producing a tool call:
2d83110e: Reversed text (.rewsna eht sa "tfel" drow eht fo etisoppo eht etirw). The model decodes it in thinking but doesn't know which tool to call for a pure-text answer — it just needs to callfinish, but extended thinking fills the response with reasoning instead.2a649bb1: Multi-step research question. The model plans which searches to run but doesn't commit to a tool call in the first 3 responses.
Evidence from conversation archives
Each retry shows the identical event sequence:
event-00000 SystemPromptEvent (agent)
event-00001 MessageEvent (user) — question delivered, content_len=1
event-00002 ConversationStateUpdateEvent — execution_status=running
event-00004 MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-... [+2.0s]
event-00005 MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-... [+1.6s]
event-00006 MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-... [+1.6s]
event-00007 ConversationStateUpdateEvent — execution_status=stuck
Each empty response has a unique chatcmpl-* ID (real LLM calls, not cached), arrives in ~1.5–2s (vs 4–94s for normal first responses), and the thinking_blocks field is [] in the archived events (likely stripped or encrypted).
Why retries don't help
The retry mechanism escalates resource_factor (1→2→4→8), which increases pod CPU/memory. But the failure is in the model's response pattern, not resource exhaustion. Each retry gets a fresh runtime, fresh session, fresh conversation — but the same system prompt + question always produces the same model behavior.
Proposed solutions
Option 1: Don't emit MessageEvent for empty responses (minimal change)
In agent.py, skip the on_event(msg_event) call when the response has no content and no reasoning. This prevents empty messages from counting toward the monologue threshold.
# Line 356-372: only emit if there's something to emit
if has_reasoning or has_content:
msg_event = MessageEvent(source="agent", llm_message=message, ...)
on_event(msg_event)Option 2: Inject a nudge after empty responses (more robust)
When the LLM returns an empty response, append a system/user message reminding it to use a tool:
if not has_reasoning and not has_content:
logger.warning("LLM produced empty response - nudging to use tools")
# Don't emit empty MessageEvent, instead continue loop
# The next LLM call will include context about available tools
continueOption 3: Exclude empty MessageEvents from monologue detection (targeted fix)
In stuck_detector.py:200–225, skip MessageEvents that have no content:
if isinstance(event, MessageEvent):
if event.source == "agent":
# Don't count empty messages toward monologue
if not event.llm_message or not event.llm_message.content:
continue
agent_message_count += 1Option 4: Increase monologue threshold for extended thinking (configuration)
A threshold of 3 is too aggressive for extended thinking models that may need several reasoning passes. Increasing it to 5–6 would give the model more room while still catching genuine monologue loops.
Recommendation
Option 1 is the safest minimal fix — an empty response with no content, no reasoning, and no tool calls provides no value as a MessageEvent and should not be emitted. Combined with Option 4 (raising the threshold to 5 for extended thinking configs), this would eliminate the false positives without masking real stuck scenarios.