Skip to content

chore: improve loop.ts tests#271

Open
wviana wants to merge 1 commit intoesengine:mainfrom
wviana:chore/improve-loop-mutation-score
Open

chore: improve loop.ts tests#271
wviana wants to merge 1 commit intoesengine:mainfrom
wviana:chore/improve-loop-mutation-score

Conversation

@wviana
Copy link
Copy Markdown
Contributor

@wviana wviana commented May 5, 2026

Summary

Strengthen the CacheFirstLoop test suite with 53 new tests covering
previously untested constructor option combinations, the configure()
method, setBudget/clearLog/retryLastUser, pro-arm lifecycle, and
the auto-escalation failure-signal logic. Mutation score for
src/loop.ts improved from 40.3% ‚Üí 54.5% (54 additional
mutants killed, 33 fewer survivors).

Motivation

The mutation report for src/loop.ts showed a large gap: most
constructor branches, the entire configure() method, and the failure
accumulation / auto-escalation logic in noteToolFailureSignal had no
assertions. Bugs in any of these areas would not be caught by the
existing suite -- they were either uncovered (no test reached the
branch) or survived (tests ran but didn't verify the outcome).

Three groups of business rules in particular needed coverage:

  • Constructor option interactions. Branch, harvest, and stream
    have cascading rules (branch forces harvest on and stream off;
    harvest object carries options through). Getting these wrong would
    silently break reasonix code presets.
  • configure() cascades. The same interaction rules apply when
    options are changed mid-session via slash commands. If configure
    didn't mirror the constructor's cascade, toggling branch off then on
    would leave harvest or stream in an inconsistent state.
  • Failure-signal escalation. noteToolFailureSignal counts
    SEARCH-mismatch errors and repair signals (scavenged / truncated /
    storm-broken tool calls) and auto-escalates to deepseek-v4-pro
    when the threshold is crossed. A logic error here would either
    escalate too late (users wait through failing flash calls) or never
    (sessions degrade without auto-recovery).

What's tested now

Constructor option combinations (20 tests)

Business rule Why it matters
branch: 3 creates branchOptions = {budget:3} and enables branch Users get multi-sample consistency
branch: 1 is a no-op (budget > 1 required) Configuring a single sample shouldn't force branch overhead
Branch forces stream: false while preserving _streamPreference Streaming is incompatible with branch -- but the user's preference must survive for later
Branch forces harvestEnabled: true even when harvest: false Branch selector needs harvest; disabling harvest shouldn't override branch
harvest: true enables harvest independently Users who want plan-state tracking without branching
harvest: {maxPlanSteps:N} sets both flag and options Harvest config object carries through to plan-state extraction
budgetUsd: 0 or negative is treated as null (disabled) Defensive against accidental zero cap
autoEscalate: false is respected Users who pin a model want no surprises

configure() method (15 tests)

Business rule Why it matters
Each ReconfigurableOptions field updates independently Slash commands change one option at a time
configure({branch:3}) forces harvest on and stream off Mid-session branch toggle must cascade identically to constructor
configure({branch:{budget:1}}) disables branch Toggling branch off restores stream preference
configure({harvest:false}) when branch is on -- harvest stays on Branch's harvest requirement cannot be overridden
configure({harvest:false}) when branch is off -- harvest disables No branch = no forced harvest
Stream preference is restored when branch is disabled User shouldn't be stuck non-streaming after /branch off

setBudget / clearLog / retryLastUser / pro-arm (14 tests)

Business rule Why it matters
setBudget(null) clears cap and re-arms the 80% warning Removing a cap must reset the sticky "approaching" flag
setBudget(0) treated same as null Defensive against accidental zero
clearLog() drops all messages and resets scratch state /clear command must produce a blank slate
retryLastUser() returns null when no user message exists Graceful on empty log -- no crash
retryLastUser() returns the last user message and trims the log Mid-conversation retry targets the most recent turn
retryLastUser() handles non-string content safely Malformed entries shouldn't crash retry
armProForNextTurn() is consumed by the next step() and produces a warning /pro intent is one-shot and user-visible
disarmPro() cancels arming before the turn starts User can change their mind before sending

noteToolFailureSignal auto-escalation (9 tests)

Business rule Why it matters
Returns false when failure count is below threshold (3) No premature escalation
Returns true and sets _escalateThisTurn when count reaches threshold Escalates exactly at the right moment
"error" + "search text not found" in a result JSON triggers a search-mismatch bump SEARCH/REPLACE mismatches are a strong "model struggling" signal
autoEscalate: false blocks escalation entirely Users who opted out stay on their chosen model
Already escalated this turn -- no double-escalation No redundant state transitions
Error without "search text not found" does NOT bump Other errors are not necessarily "model struggling"
Repair counts (scavenged, truncationsFixed, stormsBroken) bump proportionally 2 scavenged + 3 truncated = 5 failure counts toward escalation
_turnFailureTypes records granular breakdown Enables the formatFailureBreakdown() message

Mutation score impact

Metric Before After Delta
Mutation score (killed / killed+survived) 40.28% 54.50% +14.22pp
Coverage (killed+survived / total) ~40% 65.97% +25pp
Killed mutants 442 484 +42
Survived mutants 454 404 -50
Remaining NoCoverage 467 458 -9
Total tests in loop.test.ts 75 95 +20

@esengine
Copy link
Copy Markdown
Owner

esengine commented May 5, 2026

Thanks for the depth on this — the mutation-score data and the per-rule test tables are genuinely useful. One thing I'd like reworked before merge:

(loop as any). reaches into private state in 37 places across 5 internal members (_turnFailureCount, _turnFailureTypes, _escalateThisTurn, _budgetWarned, plus calling noteToolFailureSignal directly). That kills mutants today, but it pins the tests to internal field names — next time the failure-tracking representation changes, every one of these tests breaks for reasons that have nothing to do with behavior. Where the same signal is reachable by driving real tool failures through step() (the existing escalation tests already do this), prefer that path; reserve the private-field access for cases where genuinely no public surface exposes the thing being tested, and call those out as such.

A separate, smaller note: ~18 lines in tests/loop.test.ts are pure - rewrites in comments and string fixtures ("done — here's what I found.""done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.

Once the private-state access is reworked I'm happy to take the whole PR as-is — no need to split. The constructor / configure / noteToolFailureSignal blocks are well-organized as-is.

@wviana
Copy link
Copy Markdown
Contributor Author

wviana commented May 5, 2026

A separate, smaller note: ~18 lines in tests/loop.test.ts are pure — → - rewrites in comments and string fixtures ("done — here's what I found." → "done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. — is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.

Yeah, I've figured out why I was seeing some weird chars, at some moment I got some tests falling as my system was not in english, so I've set LANG=en_US and this caused me to see as broken and cause me to change it. Now I know I should set it to LANG=en_US.UTC-8 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants