Skip to content

Fixes conversations that stop responding due to stale execution status#2466

Open
DoubleDensity wants to merge 2 commits intoOpenHands:mainfrom
DoubleDensity:stuck_conversation_fix
Open

Fixes conversations that stop responding due to stale execution status#2466
DoubleDensity wants to merge 2 commits intoOpenHands:mainfrom
DoubleDensity:stuck_conversation_fix

Conversation

@DoubleDensity
Copy link

https://openhands-ai.slack.com/archives/C06U8UTKSAD/p1773351234338059?thread_ts=1773161499.244319&cid=C06U8UTKSAD

Fix was generated by OpenHands here:

https://app.all-hands.dev/conversations/9eaa41581dfa476aaca0009118cfd5be

Summary

This builds on the fix from @xingyaoww to resolve the issue of conversations that stop responding and requiring base_state.json to be cleared by directly clearing blocked conversation states

#2384

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

@xingyaoww
Copy link
Collaborator

@OpenHands /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Mar 16, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement — Violates fundamental principles

Taste Rating: 🔴

This change is well-intentioned but architecturally wrong. It solves a problem that was already solved by PR #2384, and in the process introduces a worse version of the fix that breaks state semantics.


[CRITICAL ISSUES]

[state.py, Lines 339–346] Data Structure / State Ownership: Unconditional terminal state reset on resume destroys meaningful state

The core problem: this resets FINISHED, ERROR, and STUCK to IDLE every time a conversation is loaded, regardless of whether the user actually wants to continue it.

  • FINISHED → IDLE on resume is wrong. FINISHED means "the agent completed its task." That's meaningful, persistent information. If a user resumes a conversation just to inspect it (read history, check results), you've silently nuked the completion state. The existing fix in local_conversation.py (from PR #2384) already handles FINISHED → IDLE correctly — it triggers only when a new user message arrives, which is an explicit signal to continue. This is the right semantic.

  • ERROR → IDLE on resume is dangerous. ERROR indicates something broke. Silently resetting it to IDLE on load masks the failure. PR #2384 already handles this properly — run() allows ERROR → RUNNING, letting the caller decide to retry. Don't make that decision for them in the state constructor.

  • STUCK → IDLE on resume is redundant. PR #2384 already added STUCK → IDLE on send_message() and STUCK → RUNNING on run(). Both transitions require an explicit user action. This is correct. Resetting at load time just adds a redundant layer.

[state.py] Breaking Change: Side-effect on load breaks read-only resume

Because _autosave_enabled is set to True on the very next line, the subsequent state.agent = agent assignment will trigger __setattr__ → autosave, persisting the IDLE status. This means merely loading a finished conversation changes its persisted state. That's a violation of the principle that reading should not have side effects.


[IMPROVEMENT OPPORTUNITIES]

Redundant defense-in-depth that creates confusion:

We now have terminal state resets in THREE places:

  1. state.py::create() — this PR (on resume)
  2. local_conversation.py::send_message() — from PR #2384 (on new message)
  3. local_conversation.py::run() — from PR #2384 (on explicit run)

Places (2) and (3) are correct — they trigger on explicit user actions. Place (1) is implicit and unconditional. Having both creates ambiguity about which layer owns state transitions. The answer should be clear: local_conversation.py owns execution state transitions, state.py owns persistence. Don't mix them.


[TESTING GAPS]

No tests for the actual change. The PR adds 10 lines to state.py but zero test coverage for the new behavior. Specifically missing:

  • Test that FINISHED state survives resume when no new message is sent
  • Test that ERROR state is preserved on resume for inspection
  • Test that the state.py reset interacts correctly with the local_conversation.py resets

[PR DESCRIPTION]

No Evidence section. The PR links to a Slack thread and an OpenHands conversation, but doesn't demonstrate the actual fix working. Show a before/after of a stuck conversation being recovered.


VERDICT:
Needs rework. This change is unnecessary — PR #2384 already solved the stuck conversation problem at the right layer (explicit user actions in local_conversation.py). Adding an unconditional reset at the state loading layer breaks FINISHED semantics, creates redundant transition logic, and has no tests. The fix should be removed entirely.

KEY INSIGHT:
The right place to reset terminal states is in response to explicit user actions (send_message, run), not during state deserialization. PR #2384 already got this right.

@openhands-ai
Copy link

openhands-ai bot commented Mar 16, 2026

Summary

I performed a Linus Torvalds-style roasted code review (/codereview-roasted) on PR #2466 as requested by the comment. No code changes were made — this was a review-only task.

Review Posted

The review was posted as a COMMENT on the PR at:
#2466 (review)

Key Findings

Taste Rating: 🔴 Needs improvement

The PR adds a 10-line change to state.py that unconditionally resets FINISHED, ERROR, and STUCK states to IDLE every time a conversation is resumed. The review identified four critical problems:

  1. Destroys meaningful state on resume — FINISHED means the agent completed its task. Resetting it to IDLE on every load breaks that semantic. Users who resume a conversation just to inspect it would silently lose the completion status.

  2. Side-effect on load — The reset happens right before _autosave_enabled = True, so the subsequent state.agent = agent assignment triggers autosave, meaning merely loading a finished conversation mutates and persists the state. Reading should not have write side effects.

  3. Redundant with already-merged PR fix: improve conversation resilience for long-running and resumed sessions #2384 — PR fix: improve conversation resilience for long-running and resumed sessions #2384 (merged March 11) already solved the stuck conversation problem correctly in local_conversation.py, where STUCK/FINISHED → IDLE transitions happen only in response to explicit user actions (send_message() and run()). This PR adds a third, unconditional reset point that creates ambiguity about which layer owns state transitions.

  4. No tests — Zero test coverage for the actual change.

Verdict: ❌ Needs rework — recommended removing the change entirely since PR #2384 already solved the problem at the correct architectural layer.

@DoubleDensity
Copy link
Author

after additional testing I can confirm that this does not resolve the issue -- digging deeper for a better approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants