Discrimination checks for `--fixture`: verify a case rejects near-misses, not only that the golden passes

## Summary

`--fixture` today verifies one direction: the checked-in golden `fixtureSource` *passes* every gate without calling a model. It never verifies the other direction — that a case can *reject* a plausible-but-wrong answer. A case whose `requiredSourcePatterns` + exact stdout still pass on a near-miss is a non-discriminating eval: a silent false positive that inflates live pass rates.

## Why it matters

The eval prompt intentionally omits Zero syntax (README), and #104 shows models clustering on PAR100/IMP001 — a syntax-acquisition cliff before typecheck. As skill/feedback work closes that cliff, models start emitting structurally-close-but-wrong Zero, which is exactly where weak gates leak. Each case should be proven to discriminate before it gates a frontier run.

## Proposed shape (matches existing conventions)

```ts
negativeFixtures?: {
  label: string;
  source: string;        // a near-miss
  expectGate: "stdout" | "pattern" | "check" | "run";
}[];
```

`--fixture` then asserts, per case: the golden passes **and** every negative fails — reusing `sourcePatternFailures` and the existing stdout/run comparison — reporting which gate caught each negative, or erroring if one slips through (that case's checks are too weak).

## Starter negatives

- `hello-world`: drop the trailing `\n` → must fail exact-stdout.
- `fibonacci`: remove the `fib(10) == 55` line → must fail the missing `requiredSourcePattern`, proving the per-value asserts are load-bearing.
- `scale-multi-command-cli`: return `x - y` from `multiply`, keep `help` correct → must fail the `multiply 6 7 → 42` runCheck, proving multi-route coverage isn't met by the happy path.

## Scope

Pure addition: a new optional field plus a small fixture-mode loop, with zero change to live scoring or the sandbox path. This sits entirely in the TS harness and eval methodology.

Happy to open a WIP PR with the field + the three base-case negatives, or keep this as an RFC first if you'd rather lock the `expectGate` contract before code. Preference?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrimination checks for `--fixture`: verify a case rejects near-misses, not only that the golden passes #403

Summary

Why it matters

Proposed shape (matches existing conventions)

Starter negatives

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discrimination checks for --fixture: verify a case rejects near-misses, not only that the golden passes #403

Description

Summary

Why it matters

Proposed shape (matches existing conventions)

Starter negatives

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Discrimination checks for `--fixture`: verify a case rejects near-misses, not only that the golden passes #403