You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Strengthen the CacheFirstLoop test suite with 53 new tests covering
previously untested constructor option combinations, the configure()
method, setBudget/clearLog/retryLastUser, pro-arm lifecycle, and
the auto-escalation failure-signal logic. Mutation score for src/loop.ts improved from 40.3% ‚Üí 54.5% (54 additional
mutants killed, 33 fewer survivors).
Motivation
The mutation report for src/loop.ts showed a large gap: most
constructor branches, the entire configure() method, and the failure
accumulation / auto-escalation logic in noteToolFailureSignal had no
assertions. Bugs in any of these areas would not be caught by the
existing suite -- they were either uncovered (no test reached the
branch) or survived (tests ran but didn't verify the outcome).
Three groups of business rules in particular needed coverage:
Constructor option interactions. Branch, harvest, and stream
have cascading rules (branch forces harvest on and stream off;
harvest object carries options through). Getting these wrong would
silently break reasonix code presets.
configure() cascades. The same interaction rules apply when
options are changed mid-session via slash commands. If configure
didn't mirror the constructor's cascade, toggling branch off then on
would leave harvest or stream in an inconsistent state.
Failure-signal escalation.noteToolFailureSignal counts
SEARCH-mismatch errors and repair signals (scavenged / truncated /
storm-broken tool calls) and auto-escalates to deepseek-v4-pro
when the threshold is crossed. A logic error here would either
escalate too late (users wait through failing flash calls) or never
(sessions degrade without auto-recovery).
What's tested now
Constructor option combinations (20 tests)
Business rule
Why it matters
branch: 3 creates branchOptions = {budget:3} and enables branch
Users get multi-sample consistency
branch: 1 is a no-op (budget > 1 required)
Configuring a single sample shouldn't force branch overhead
Branch forces stream: false while preserving _streamPreference
Streaming is incompatible with branch -- but the user's preference must survive for later
Branch forces harvestEnabled: true even when harvest: false
Thanks for the depth on this — the mutation-score data and the per-rule test tables are genuinely useful. One thing I'd like reworked before merge:
(loop as any). reaches into private state in 37 places across 5 internal members (_turnFailureCount, _turnFailureTypes, _escalateThisTurn, _budgetWarned, plus calling noteToolFailureSignal directly). That kills mutants today, but it pins the tests to internal field names — next time the failure-tracking representation changes, every one of these tests breaks for reasons that have nothing to do with behavior. Where the same signal is reachable by driving real tool failures through step() (the existing escalation tests already do this), prefer that path; reserve the private-field access for cases where genuinely no public surface exposes the thing being tested, and call those out as such.
A separate, smaller note: ~18 lines in tests/loop.test.ts are pure — → - rewrites in comments and string fixtures ("done — here's what I found." → "done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. — is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.
Once the private-state access is reworked I'm happy to take the whole PR as-is — no need to split. The constructor / configure / noteToolFailureSignal blocks are well-organized as-is.
A separate, smaller note: ~18 lines in tests/loop.test.ts are pure — → - rewrites in comments and string fixtures ("done — here's what I found." → "done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. — is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.
Yeah, I've figured out why I was seeing some weird chars, at some moment I got some tests falling as my system was not in english, so I've set LANG=en_US and this caused me to see as broken and cause me to change it. Now I know I should set it to LANG=en_US.UTC-8 instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Strengthen the
CacheFirstLooptest suite with 53 new tests coveringpreviously untested constructor option combinations, the
configure()method,
setBudget/clearLog/retryLastUser, pro-arm lifecycle, andthe auto-escalation failure-signal logic. Mutation score for
src/loop.tsimproved from 40.3% ‚Üí 54.5% (54 additionalmutants killed, 33 fewer survivors).
Motivation
The mutation report for
src/loop.tsshowed a large gap: mostconstructor branches, the entire
configure()method, and the failureaccumulation / auto-escalation logic in
noteToolFailureSignalhad noassertions. Bugs in any of these areas would not be caught by the
existing suite -- they were either uncovered (no test reached the
branch) or survived (tests ran but didn't verify the outcome).
Three groups of business rules in particular needed coverage:
have cascading rules (branch forces harvest on and stream off;
harvest object carries options through). Getting these wrong would
silently break
reasonix codepresets.configure()cascades. The same interaction rules apply whenoptions are changed mid-session via slash commands. If
configuredidn't mirror the constructor's cascade, toggling branch off then on
would leave harvest or stream in an inconsistent state.
noteToolFailureSignalcountsSEARCH-mismatch errors and repair signals (scavenged / truncated /
storm-broken tool calls) and auto-escalates to
deepseek-v4-prowhen the threshold is crossed. A logic error here would either
escalate too late (users wait through failing flash calls) or never
(sessions degrade without auto-recovery).
What's tested now
Constructor option combinations (20 tests)
branch: 3createsbranchOptions = {budget:3}and enables branchbranch: 1is a no-op (budget > 1 required)stream: falsewhile preserving_streamPreferenceharvestEnabled: trueeven whenharvest: falseharvest: trueenables harvest independentlyharvest: {maxPlanSteps:N}sets both flag and optionsbudgetUsd: 0or negative is treated asnull(disabled)autoEscalate: falseis respectedconfigure()method (15 tests)ReconfigurableOptionsfield updates independentlyconfigure({branch:3})forces harvest on and stream offconfigure({branch:{budget:1}})disables branchconfigure({harvest:false})when branch is on -- harvest stays onconfigure({harvest:false})when branch is off -- harvest disables/branch offsetBudget/clearLog/retryLastUser/ pro-arm (14 tests)setBudget(null)clears cap and re-arms the 80% warningsetBudget(0)treated same asnullclearLog()drops all messages and resets scratch state/clearcommand must produce a blank slateretryLastUser()returnsnullwhen no user message existsretryLastUser()returns the last user message and trims the logretryLastUser()handles non-string content safelyarmProForNextTurn()is consumed by the nextstep()and produces a warning/prointent is one-shot and user-visibledisarmPro()cancels arming before the turn startsnoteToolFailureSignalauto-escalation (9 tests)falsewhen failure count is below threshold (3)trueand sets_escalateThisTurnwhen count reaches threshold"error"+"search text not found"in a result JSON triggers asearch-mismatchbumpautoEscalate: falseblocks escalation entirely"search text not found"does NOT bumpscavenged,truncationsFixed,stormsBroken) bump proportionally_turnFailureTypesrecords granular breakdownformatFailureBreakdown()messageMutation score impact