Summary
--fixture today verifies one direction: the checked-in golden fixtureSource passes every gate without calling a model. It never verifies the other direction — that a case can reject a plausible-but-wrong answer. A case whose requiredSourcePatterns + exact stdout still pass on a near-miss is a non-discriminating eval: a silent false positive that inflates live pass rates.
Why it matters
The eval prompt intentionally omits Zero syntax (README), and #104 shows models clustering on PAR100/IMP001 — a syntax-acquisition cliff before typecheck. As skill/feedback work closes that cliff, models start emitting structurally-close-but-wrong Zero, which is exactly where weak gates leak. Each case should be proven to discriminate before it gates a frontier run.
Proposed shape (matches existing conventions)
negativeFixtures?: {
label: string;
source: string; // a near-miss
expectGate: "stdout" | "pattern" | "check" | "run";
}[];
--fixture then asserts, per case: the golden passes and every negative fails — reusing sourcePatternFailures and the existing stdout/run comparison — reporting which gate caught each negative, or erroring if one slips through (that case's checks are too weak).
Starter negatives
hello-world: drop the trailing \n → must fail exact-stdout.
fibonacci: remove the fib(10) == 55 line → must fail the missing requiredSourcePattern, proving the per-value asserts are load-bearing.
scale-multi-command-cli: return x - y from multiply, keep help correct → must fail the multiply 6 7 → 42 runCheck, proving multi-route coverage isn't met by the happy path.
Scope
Pure addition: a new optional field plus a small fixture-mode loop, with zero change to live scoring or the sandbox path. This sits entirely in the TS harness and eval methodology.
Happy to open a WIP PR with the field + the three base-case negatives, or keep this as an RFC first if you'd rather lock the expectGate contract before code. Preference?
Summary
--fixturetoday verifies one direction: the checked-in goldenfixtureSourcepasses every gate without calling a model. It never verifies the other direction — that a case can reject a plausible-but-wrong answer. A case whoserequiredSourcePatterns+ exact stdout still pass on a near-miss is a non-discriminating eval: a silent false positive that inflates live pass rates.Why it matters
The eval prompt intentionally omits Zero syntax (README), and #104 shows models clustering on PAR100/IMP001 — a syntax-acquisition cliff before typecheck. As skill/feedback work closes that cliff, models start emitting structurally-close-but-wrong Zero, which is exactly where weak gates leak. Each case should be proven to discriminate before it gates a frontier run.
Proposed shape (matches existing conventions)
--fixturethen asserts, per case: the golden passes and every negative fails — reusingsourcePatternFailuresand the existing stdout/run comparison — reporting which gate caught each negative, or erroring if one slips through (that case's checks are too weak).Starter negatives
hello-world: drop the trailing\n→ must fail exact-stdout.fibonacci: remove thefib(10) == 55line → must fail the missingrequiredSourcePattern, proving the per-value asserts are load-bearing.scale-multi-command-cli: returnx - yfrommultiply, keephelpcorrect → must fail themultiply 6 7 → 42runCheck, proving multi-route coverage isn't met by the happy path.Scope
Pure addition: a new optional field plus a small fixture-mode loop, with zero change to live scoring or the sandbox path. This sits entirely in the TS harness and eval methodology.
Happy to open a WIP PR with the field + the three base-case negatives, or keep this as an RFC first if you'd rather lock the
expectGatecontract before code. Preference?