[Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models

## Summary

Certain GAIA benchmark instances deterministically fail with `Remote conversation got stuck` across all retries, all SDK versions, and both Claude Sonnet 4.5 and Opus 4.6. The root cause is a conflict between the stuck detector's monologue threshold (3 consecutive agent `MessageEvent`s) and extended thinking models that produce reasoning-only responses (no text, no tool calls) before their first action.

## Affected instances

| Instance | Question | Sonnet (run 1) | Sonnet (run 2) | Sonnet (run 3) | Opus (run 4) | Opus (run 5) |
|---|---|---|---|---|---|---|
| `2d83110e` | Reversed text puzzle | STUCK | STUCK | STUCK | STUCK | STUCK |
| `2a649bb1` | EC numbers for virus testing | **PASS** | STUCK | STUCK | STUCK | STUCK |
| `6359a0b1` | Green polygon area | FAIL | FAIL | FAIL | STUCK | STUCK |
| `983bba7c` | Animals in alvei papers | PASS | PASS | FAIL | STUCK | PASS |

`2d83110e` fails in **100% of runs** (5/5). `2a649bb1` fails in 4/5. The Opus runs are more affected (up to 4 stuck instances vs 1–2 for Sonnet).

## Reproduction

Run the GAIA benchmark with Claude Sonnet 4.5 or Opus 4.6 with extended thinking enabled and OpenHands agent, Instance `2d83110e` will fail every time.

## Root cause

### The code path (`agent.py:298–382`)

1. The LLM returns a valid response (`chatcmpl-*` ID) containing only thinking/reasoning blocks — no text content, no tool calls.
2. `agent.py:302–309` detects `has_reasoning` and/or `has_content`:
   ```python
   has_reasoning = (
       message.responses_reasoning_item is not None
       or message.reasoning_content is not None
       or (message.thinking_blocks and len(message.thinking_blocks) > 0)
   )
   has_content = any(
       isinstance(c, TextContent) and c.text.strip() for c in message.content
   )
   ```
3. Since there are no tool calls, execution falls through to line 356. If neither reasoning nor content is found, a warning is logged. **But in all cases, a `MessageEvent` is emitted unconditionally** (lines 359–372):
   ```python
   msg_event = MessageEvent(source="agent", llm_message=message, ...)
   on_event(msg_event)  # Emitted even when empty
   ```
4. Since `has_content` is False, `execution_status` is not set to FINISHED (line 379–382), so the agent loop continues and calls the LLM again.
5. The LLM produces the same reasoning-only response. This repeats 3 times.

### The stuck detection (`stuck_detector.py:200–225`)

6. `_is_stuck_monologue()` counts consecutive `MessageEvent`s with `source="agent"`:
   ```python
   for event in reversed(events):
       if isinstance(event, MessageEvent):
           if event.source == "agent":
               agent_message_count += 1
   ```
7. The **monologue threshold is 3** (`types.py:40–41`). After 3 empty `MessageEvent`s, the conversation is marked `STUCK`.

### The error propagation (`remote_conversation.py:1035–1039`)

8. The remote client detects the STUCK status and raises `ConversationRunError("Remote conversation got stuck")`.

## Why these specific instances

These questions cause the model to reason internally without producing a tool call:

- **`2d83110e`**: Reversed text (`.rewsna eht sa "tfel" drow eht fo etisoppo eht etirw`). The model decodes it in thinking but doesn't know which tool to call for a pure-text answer — it just needs to call `finish`, but extended thinking fills the response with reasoning instead.
- **`2a649bb1`**: Multi-step research question. The model plans which searches to run but doesn't commit to a tool call in the first 3 responses.

## Evidence from conversation archives

Each retry shows the identical event sequence:

```
event-00000  SystemPromptEvent (agent)
event-00001  MessageEvent (user) — question delivered, content_len=1
event-00002  ConversationStateUpdateEvent — execution_status=running
event-00004  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+2.0s]
event-00005  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+1.6s]
event-00006  MessageEvent (agent) — content=[], thinking_blocks=[], resp_id=chatcmpl-...  [+1.6s]
event-00007  ConversationStateUpdateEvent — execution_status=stuck
```

Each empty response has a **unique `chatcmpl-*` ID** (real LLM calls, not cached), arrives in **~1.5–2s** (vs 4–94s for normal first responses), and the `thinking_blocks` field is `[]` in the archived events (likely stripped or encrypted).

## Why retries don't help

The retry mechanism escalates `resource_factor` (1→2→4→8), which increases pod CPU/memory. But the failure is in the model's response pattern, not resource exhaustion. Each retry gets a fresh runtime, fresh session, fresh conversation — but the same system prompt + question always produces the same model behavior.

## Proposed solutions

### Option 1: Don't emit `MessageEvent` for empty responses (minimal change)

In `agent.py`, skip the `on_event(msg_event)` call when the response has no content and no reasoning. This prevents empty messages from counting toward the monologue threshold.

```python
# Line 356-372: only emit if there's something to emit
if has_reasoning or has_content:
    msg_event = MessageEvent(source="agent", llm_message=message, ...)
    on_event(msg_event)
```

### Option 2: Inject a nudge after empty responses (more robust)

When the LLM returns an empty response, append a system/user message reminding it to use a tool:

```python
if not has_reasoning and not has_content:
    logger.warning("LLM produced empty response - nudging to use tools")
    # Don't emit empty MessageEvent, instead continue loop
    # The next LLM call will include context about available tools
    continue
```

### Option 3: Exclude empty `MessageEvent`s from monologue detection (targeted fix)

In `stuck_detector.py:200–225`, skip `MessageEvent`s that have no content:

```python
if isinstance(event, MessageEvent):
    if event.source == "agent":
        # Don't count empty messages toward monologue
        if not event.llm_message or not event.llm_message.content:
            continue
        agent_message_count += 1
```

### Option 4: Increase monologue threshold for extended thinking (configuration)

A threshold of 3 is too aggressive for extended thinking models that may need several reasoning passes. Increasing it to 5–6 would give the model more room while still catching genuine monologue loops.

## Recommendation

**Option 1** is the safest minimal fix — an empty response with no content, no reasoning, and no tool calls provides no value as a `MessageEvent` and should not be emitted. Combined with **Option 4** (raising the threshold to 5 for extended thinking configs), this would eliminate the false positives without masking real stuck scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482

Summary

Affected instances

Reproduction

Root cause

The code path (`agent.py:298–382`)

The stuck detection (`stuck_detector.py:200–225`)

The error propagation (`remote_conversation.py:1035–1039`)

Why these specific instances

Evidence from conversation archives

Why retries don't help

Proposed solutions

Option 1: Don't emit `MessageEvent` for empty responses (minimal change)

Option 2: Inject a nudge after empty responses (more robust)

Option 3: Exclude empty `MessageEvent`s from monologue detection (targeted fix)

Option 4: Increase monologue threshold for extended thinking (configuration)

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instance	Question	Sonnet (run 1)	Sonnet (run 2)	Sonnet (run 3)	Opus (run 4)	Opus (run 5)
`2d83110e`	Reversed text puzzle	STUCK	STUCK	STUCK	STUCK	STUCK
`2a649bb1`	EC numbers for virus testing	PASS	STUCK	STUCK	STUCK	STUCK
`6359a0b1`	Green polygon area	FAIL	FAIL	FAIL	STUCK	STUCK
`983bba7c`	Animals in alvei papers	PASS	PASS	FAIL	STUCK	PASS

[Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482

Description

Summary

Affected instances

Reproduction

Root cause

The code path (agent.py:298–382)

The stuck detection (stuck_detector.py:200–225)

The error propagation (remote_conversation.py:1035–1039)

Why these specific instances

Evidence from conversation archives

Why retries don't help

Proposed solutions

Option 1: Don't emit MessageEvent for empty responses (minimal change)

Option 2: Inject a nudge after empty responses (more robust)

Option 3: Exclude empty MessageEvents from monologue detection (targeted fix)

Option 4: Increase monologue threshold for extended thinking (configuration)

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The code path (`agent.py:298–382`)

The stuck detection (`stuck_detector.py:200–225`)

The error propagation (`remote_conversation.py:1035–1039`)

Option 1: Don't emit `MessageEvent` for empty responses (minimal change)

Option 3: Exclude empty `MessageEvent`s from monologue detection (targeted fix)