chore: improve loop.ts tests by wviana · Pull Request #271 · esengine/DeepSeek-Reasonix

wviana · 2026-05-05T17:19:15Z

Summary

Strengthen the CacheFirstLoop test suite with 53 new tests covering
previously untested constructor option combinations, the configure()
method, setBudget/clearLog/retryLastUser, pro-arm lifecycle, and
the auto-escalation failure-signal logic. Mutation score for
src/loop.ts improved from 40.3% ‚Üí 54.5% (54 additional
mutants killed, 33 fewer survivors).

Motivation

The mutation report for src/loop.ts showed a large gap: most
constructor branches, the entire configure() method, and the failure
accumulation / auto-escalation logic in noteToolFailureSignal had no
assertions. Bugs in any of these areas would not be caught by the
existing suite -- they were either uncovered (no test reached the
branch) or survived (tests ran but didn't verify the outcome).

Three groups of business rules in particular needed coverage:

Constructor option interactions. Branch, harvest, and stream
have cascading rules (branch forces harvest on and stream off;
harvest object carries options through). Getting these wrong would
silently break reasonix code presets.
configure() cascades. The same interaction rules apply when
options are changed mid-session via slash commands. If configure
didn't mirror the constructor's cascade, toggling branch off then on
would leave harvest or stream in an inconsistent state.
Failure-signal escalation. noteToolFailureSignal counts
SEARCH-mismatch errors and repair signals (scavenged / truncated /
storm-broken tool calls) and auto-escalates to deepseek-v4-pro
when the threshold is crossed. A logic error here would either
escalate too late (users wait through failing flash calls) or never
(sessions degrade without auto-recovery).

What's tested now

Constructor option combinations (20 tests)

Business rule	Why it matters
`branch: 3` creates `branchOptions = {budget:3}` and enables branch	Users get multi-sample consistency
`branch: 1` is a no-op (budget > 1 required)	Configuring a single sample shouldn't force branch overhead
Branch forces `stream: false` while preserving `_streamPreference`	Streaming is incompatible with branch -- but the user's preference must survive for later
Branch forces `harvestEnabled: true` even when `harvest: false`	Branch selector needs harvest; disabling harvest shouldn't override branch
`harvest: true` enables harvest independently	Users who want plan-state tracking without branching
`harvest: {maxPlanSteps:N}` sets both flag and options	Harvest config object carries through to plan-state extraction
`budgetUsd: 0` or negative is treated as `null` (disabled)	Defensive against accidental zero cap
`autoEscalate: false` is respected	Users who pin a model want no surprises

`configure()` method (15 tests)

Business rule	Why it matters
Each `ReconfigurableOptions` field updates independently	Slash commands change one option at a time
`configure({branch:3})` forces harvest on and stream off	Mid-session branch toggle must cascade identically to constructor
`configure({branch:{budget:1}})` disables branch	Toggling branch off restores stream preference
`configure({harvest:false})` when branch is on -- harvest stays on	Branch's harvest requirement cannot be overridden
`configure({harvest:false})` when branch is off -- harvest disables	No branch = no forced harvest
Stream preference is restored when branch is disabled	User shouldn't be stuck non-streaming after `/branch off`

`setBudget` / `clearLog` / `retryLastUser` / pro-arm (14 tests)

Business rule	Why it matters
`setBudget(null)` clears cap and re-arms the 80% warning	Removing a cap must reset the sticky "approaching" flag
`setBudget(0)` treated same as `null`	Defensive against accidental zero
`clearLog()` drops all messages and resets scratch state	`/clear` command must produce a blank slate
`retryLastUser()` returns `null` when no user message exists	Graceful on empty log -- no crash
`retryLastUser()` returns the last user message and trims the log	Mid-conversation retry targets the most recent turn
`retryLastUser()` handles non-string content safely	Malformed entries shouldn't crash retry
`armProForNextTurn()` is consumed by the next `step()` and produces a warning	`/pro` intent is one-shot and user-visible
`disarmPro()` cancels arming before the turn starts	User can change their mind before sending

`noteToolFailureSignal` auto-escalation (9 tests)

Business rule	Why it matters
Returns `false` when failure count is below threshold (3)	No premature escalation
Returns `true` and sets `_escalateThisTurn` when count reaches threshold	Escalates exactly at the right moment
`"error"` + `"search text not found"` in a result JSON triggers a `search-mismatch` bump	SEARCH/REPLACE mismatches are a strong "model struggling" signal
`autoEscalate: false` blocks escalation entirely	Users who opted out stay on their chosen model
Already escalated this turn -- no double-escalation	No redundant state transitions
Error without `"search text not found"` does NOT bump	Other errors are not necessarily "model struggling"
Repair counts (`scavenged`, `truncationsFixed`, `stormsBroken`) bump proportionally	2 scavenged + 3 truncated = 5 failure counts toward escalation
`_turnFailureTypes` records granular breakdown	Enables the `formatFailureBreakdown()` message

Mutation score impact

Metric	Before	After	Delta
Mutation score (killed / killed+survived)	40.28%	54.50%	+14.22pp
Coverage (killed+survived / total)	~40%	65.97%	+25pp
Killed mutants	442	484	+42
Survived mutants	454	404	-50
Remaining NoCoverage	467	458	-9
Total tests in loop.test.ts	75	95	+20

esengine · 2026-05-05T21:55:28Z

Thanks for the depth on this — the mutation-score data and the per-rule test tables are genuinely useful. One thing I'd like reworked before merge:

(loop as any). reaches into private state in 37 places across 5 internal members (_turnFailureCount, _turnFailureTypes, _escalateThisTurn, _budgetWarned, plus calling noteToolFailureSignal directly). That kills mutants today, but it pins the tests to internal field names — next time the failure-tracking representation changes, every one of these tests breaks for reasons that have nothing to do with behavior. Where the same signal is reachable by driving real tool failures through step() (the existing escalation tests already do this), prefer that path; reserve the private-field access for cases where genuinely no public surface exposes the thing being tested, and call those out as such.

A separate, smaller note: ~18 lines in tests/loop.test.ts are pure — → - rewrites in comments and string fixtures ("done — here's what I found." → "done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. — is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.

Once the private-state access is reworked I'm happy to take the whole PR as-is — no need to split. The constructor / configure / noteToolFailureSignal blocks are well-organized as-is.

wviana · 2026-05-05T22:11:53Z

A separate, smaller note: ~18 lines in tests/loop.test.ts are pure — → - rewrites in comments and string fixtures ("done — here's what I found." → "done - here's what I found.", etc.) — looks like an editor / encoding setting stripped non-ASCII. — is the project's house style. Not a blocker; happy to take that as a follow-up cleanup or revert here, your call.

Yeah, I've figured out why I was seeing some weird chars, at some moment I got some tests falling as my system was not in english, so I've set LANG=en_US and this caused me to see as broken and cause me to change it. Now I know I should set it to LANG=en_US.UTC-8 instead.

chore: improve loop.ts tests

4aef923

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve loop.ts tests#271

chore: improve loop.ts tests#271
wviana wants to merge 1 commit intoesengine:mainfrom
wviana:chore/improve-loop-mutation-score

wviana commented May 5, 2026

Uh oh!

esengine commented May 5, 2026 •

edited

Loading

Uh oh!

wviana commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wviana commented May 5, 2026

Summary

Motivation

What's tested now

Constructor option combinations (20 tests)

configure() method (15 tests)

setBudget / clearLog / retryLastUser / pro-arm (14 tests)

noteToolFailureSignal auto-escalation (9 tests)

Mutation score impact

Uh oh!

esengine commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wviana commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`configure()` method (15 tests)

`setBudget` / `clearLog` / `retryLastUser` / pro-arm (14 tests)

`noteToolFailureSignal` auto-escalation (9 tests)

esengine commented May 5, 2026 •

edited

Loading