Skip to content

Discrimination checks for --fixture: verify a case rejects near-misses, not only that the golden passes #403

@Alemusica

Description

@Alemusica

Summary

--fixture today verifies one direction: the checked-in golden fixtureSource passes every gate without calling a model. It never verifies the other direction — that a case can reject a plausible-but-wrong answer. A case whose requiredSourcePatterns + exact stdout still pass on a near-miss is a non-discriminating eval: a silent false positive that inflates live pass rates.

Why it matters

The eval prompt intentionally omits Zero syntax (README), and #104 shows models clustering on PAR100/IMP001 — a syntax-acquisition cliff before typecheck. As skill/feedback work closes that cliff, models start emitting structurally-close-but-wrong Zero, which is exactly where weak gates leak. Each case should be proven to discriminate before it gates a frontier run.

Proposed shape (matches existing conventions)

negativeFixtures?: {
  label: string;
  source: string;        // a near-miss
  expectGate: "stdout" | "pattern" | "check" | "run";
}[];

--fixture then asserts, per case: the golden passes and every negative fails — reusing sourcePatternFailures and the existing stdout/run comparison — reporting which gate caught each negative, or erroring if one slips through (that case's checks are too weak).

Starter negatives

  • hello-world: drop the trailing \n → must fail exact-stdout.
  • fibonacci: remove the fib(10) == 55 line → must fail the missing requiredSourcePattern, proving the per-value asserts are load-bearing.
  • scale-multi-command-cli: return x - y from multiply, keep help correct → must fail the multiply 6 7 → 42 runCheck, proving multi-route coverage isn't met by the happy path.

Scope

Pure addition: a new optional field plus a small fixture-mode loop, with zero change to live scoring or the sandbox path. This sits entirely in the TS harness and eval methodology.

Happy to open a WIP PR with the field + the three base-case negatives, or keep this as an RFC first if you'd rather lock the expectGate contract before code. Preference?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions