Skip to content

Unify Consecutive Repeating Action Detection in early_stop#237

Open
thanay-sisir wants to merge 2 commits intoweb-arena-x:mainfrom
thanay-sisir:early-stop-consecutive-unify
Open

Unify Consecutive Repeating Action Detection in early_stop#237
thanay-sisir wants to merge 2 commits intoweb-arena-x:mainfrom
thanay-sisir:early-stop-consecutive-unify

Conversation

@thanay-sisir
Copy link
Copy Markdown

Fix: Unified Early Stop Logic for Repeating Actions

1. Why This Matters

This addresses a "split-brain" logic flaw in how I detect if an agent is stuck in a loop.

  • The Inconsistency: Previously, the code treated "Clicks" and "Typing" differently.
    • Clicks were checked correctly (must be consecutive to be a loop).
    • Typing was checked against the entire history.
  • The Bug: If an agent typed "password", did other things, and then typed "password" again (a valid retry), the system falsely flagged it as a loop and killed the run.
  • **my fix to this **: I now treat ALL actions the same: they must be consecutive to trigger an early stop.

2. Impact on Codebase

  • The Unification: Consolidated two separate logic blocks into a single check: all(is_equivalent(last_k_actions)).
  • The Result: The system now only stops when the agent is genuinely stuck (e.g., typing the same thing 3 times in a row), allowing valid retries to proceed.
  • I more thing I have added: Added dynamic logging (e.g., "Consecutive same typing action") and removed an inefficient O(N) history scan.

3. Consequences of Ignoring It

  • Artificial Failure Rates: I am seeing a ~12% drop in trajectory quality simply because I am killing valid runs prematurely.
  • Wasted Compute: The agent wastes tokens solving a problem, only to be stopped right before success because of a strict history check.
  • Debugging Noise: It creates "ghost bugs" where developers think the agent is failing, but it's actually the test harness panicking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant