Skip to content

[Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482

@VascoSch92

Description

@VascoSch92

Summary

Certain GAIA benchmark instances deterministically fail with Remote conversation got stuck across all retries, all SDK versions, and both Claude Sonnet 4.5 and Opus 4.6. The root cause is a conflict between the stuck detector's monologue threshold (3 consecutive agent MessageEvents) and extended thinking models that produce reasoning-only responses (no text, no tool calls) before their first action.

Affected instances

Instance Question Sonnet (run 1) Sonnet (run 2) Sonnet (run 3) Opus (run 4) Opus (run 5)
2d83110e Reversed text puzzle STUCK STUCK STUCK STUCK STUCK
2a649bb1 EC numbers for virus testing PASS STUCK STUCK STUCK STUCK
6359a0b1 Green polygon area FAIL FAIL FAIL STUCK STUCK
983bba7c Animals in alvei papers PASS PASS FAIL STUCK PASS

2d83110e fails in 100% of runs (5/5). 2a649bb1 fails in 4/5. The Opus runs are more affected (up to 4 stuck instances vs 1–2 for Sonnet).

Reproduction

Run the GAIA benchmark with Claude Sonnet 4.5 or Opus 4.6 with extended thinking enabled and OpenHands agent, Instance 2d83110e will fail every time.

Root cause

The code path (agent.py:298–382)

  1. The LLM returns a valid response (chatcmpl-* ID) containing only thinking/reasoning blocks — no text content, no tool calls.
  2. agent.py:302–309 detects has_reasoning and/or has_content:
    has_reasoning = (
        message.responses_reasoning_item is not None
        or message.reasoning_content is not None
        or (message.thinking_blocks and len(message.thinking_blocks) > 0)
    )
    has_content = any(
        isinstance(c, TextContent) and c.text.strip() for c in message.content
    )
  3. Since there are no tool calls, execution falls through to line 356. If neither reasoning nor content is found, a warning is logged. But in all cases, a MessageEvent is emitted unconditionally (lines 359–372):
    msg_event = MessageEvent(source="agent", llm_message=message, ...)
    on_event(msg_event)  # Emitted even when empty
  4. Since has_content is False, execution_status is not set to FINISHED (line 379–382), so the agent loop continues and calls the LLM again.
  5. The LLM produces the same reasoning-only response. This repeats 3 times.

The stuck detection (stuck_detector.py:200–225)

  1. _is_stuck_monologue() counts consecutive MessageEvents with source="agent":
    for event in reversed(events):
        if isinstance(event, MessageEvent):
            if event.source == "agent":
                agent_message_count += 1
  2. The monologue threshold is 3 (types.py:40–41). After 3 empty MessageEvents, the conversation is marked STUCK.

The error propagation (remote_conversation.py:1035–1039)

  1. The remote client detects the STUCK status and raises ConversationRunError("Remote conversation got stuck").

Why these specific instances

These questions cause the model to reason internally without producing a tool call:

  • 2d83110e: Reversed text (.rewsna eht sa "tfel" drow eht fo etisoppo eht etirw). The model decodes it in thinking but doesn't know which tool to call for a pure-text answer — it just needs to call finish, but extended thinking fills the response with reasoning instead.
  • 2a649bb1: Multi-step research question. The model plans which searches to run but doesn't commit to a tool call in the first 3 responses.

Evidence from conversation archives

Each retry shows the identical event sequence:

event-00000  SystemPromptEvent (agent)
event-00001  MessageEvent (user) — question delivered, content_len=1
event-00002  ConversationStateUpdateEvent — execution_status=running
event-00004  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+2.0s]
event-00005  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+1.6s]
event-00006  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+1.6s]
event-00007  ConversationStateUpdateEvent — execution_status=stuck

Each empty response has a unique chatcmpl-* ID (real LLM calls, not cached), arrives in ~1.5–2s (vs 4–94s for normal first responses), and the thinking_blocks field is [] in the archived events (likely stripped or encrypted).

Why retries don't help

The retry mechanism escalates resource_factor (1→2→4→8), which increases pod CPU/memory. But the failure is in the model's response pattern, not resource exhaustion. Each retry gets a fresh runtime, fresh session, fresh conversation — but the same system prompt + question always produces the same model behavior.

Proposed solutions

Option 1: Don't emit MessageEvent for empty responses (minimal change)

In agent.py, skip the on_event(msg_event) call when the response has no content and no reasoning. This prevents empty messages from counting toward the monologue threshold.

# Line 356-372: only emit if there's something to emit
if has_reasoning or has_content:
    msg_event = MessageEvent(source="agent", llm_message=message, ...)
    on_event(msg_event)

Option 2: Inject a nudge after empty responses (more robust)

When the LLM returns an empty response, append a system/user message reminding it to use a tool:

if not has_reasoning and not has_content:
    logger.warning("LLM produced empty response - nudging to use tools")
    # Don't emit empty MessageEvent, instead continue loop
    # The next LLM call will include context about available tools
    continue

Option 3: Exclude empty MessageEvents from monologue detection (targeted fix)

In stuck_detector.py:200–225, skip MessageEvents that have no content:

if isinstance(event, MessageEvent):
    if event.source == "agent":
        # Don't count empty messages toward monologue
        if not event.llm_message or not event.llm_message.content:
            continue
        agent_message_count += 1

Option 4: Increase monologue threshold for extended thinking (configuration)

A threshold of 3 is too aggressive for extended thinking models that may need several reasoning passes. Increasing it to 5–6 would give the model more room while still catching genuine monologue loops.

Recommendation

Option 1 is the safest minimal fix — an empty response with no content, no reasoning, and no tool calls provides no value as a MessageEvent and should not be emitted. Combined with Option 4 (raising the threshold to 5 for extended thinking configs), this would eliminate the false positives without masking real stuck scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions