fix(voice): main python-ci red — stale feature-file contract counts (108→127) after #561

**`main` python-ci is red and blocking CI for all scenario PRs.** Two voice feature-file contract tests assert hardcoded scenario counts that the test suite has since outgrown.

## Expected

`tests/voice/test_feature_file_contract.py` passes on `main`; `python-ci` is green so PRs can merge against a passing baseline.

## Actual

Two tests fail:
- `test_feature_file_declares_expected_scenario_count` — asserts `len(scenarios) == 108`, but the feature files now contain **127**.
- `test_tag_split_matches_prove_it_report` — asserts `(unit, integration, e2e) == (75, 8, 25)`, but the actual split is **(79, 13, 35)**.

Both assertion messages prescribe the remedy: update the count(s) in the test **and** regenerate `docs/proposals/issue-350-prove-it-report.md`.

## Reproduction

- On `main` (latest `1c5a66c7`), run the python-ci `test (3.12)` job — or `cd python && uv run pytest tests/voice/test_feature_file_contract.py`.
- Root cause: the voice/ts-parity work that landed in **PR #561** (merge commit `5847c4b4`, 2026-06-04 15:10Z) added voice scenarios (ts-pipecat / LiveKit adapter scenarios) without updating the hardcoded counts in the contract test or the prove-it report.
- main went red at `5847c4b4` and has stayed red across the subsequent merges (`3765f3c5`, `1c5a66c7`) — ~3.5h with no fix in flight.

## Investigation

**Root cause confirmed (BLUF):** `python/tests/voice/test_feature_file_contract.py` hardcodes two stale snapshots of `specs/voice-agents.feature` — a total scenario count (`108`) and a per-tag split (`75 @unit / 8 @integration / 25 @e2e`). **PR #561 (`5847c4b4`) grew the feature file from 108 → 127 scenarios** without touching these constants, so two assertions now fail and `main` python-ci is red. The fix is purely **updating the constants (and their explanatory docstrings) to the live values** `127` and `(79, 13, 35)`. No production code is involved — this is a guard test that drifted behind the contract it guards.

**Phase 2 early-exit:** unambiguous defect, high confidence — both failures reproduced locally with exact actuals matching the issue body, and the 108→127 delta is git-bisected to a single commit. Phase 4 strategy enumeration was not exhausted because the defect is a literal stale-constant mismatch, not a design choice.

### Evidence (run this turn, worktree `issue609/...`, HEAD `1c5a66c7`)

- **Failure 1** — `test_feature_file_declares_expected_scenario_count`:
  `AssertionError: Expected 108 scenarios; found 127. … assert 127 == 108`
- **Failure 2** — `test_tag_split_matches_prove_it_report`:
  `AssertionError: Expected 75 @unit / 8 @integration / 25 @e2e; found 79 / 13 / 35. … assert (79, 13, 35) == (75, 8, 25)`
- **Split is internally consistent:** `79 + 13 + 35 = 127` — the new total and the new split agree. (Every scenario is tagged exactly one of the three; `test_every_scenario_is_tagged_unit_integration_or_e2e` passes, confirming no untagged or double-counted scenarios.)
- **Bisect of the 108→127 jump:**
  `git show 5847c4b4~1:specs/voice-agents.feature | grep -c "^\s*Scenario:"` → **108**;
  `git show 5847c4b4:specs/voice-agents.feature | grep -c …` → **127**. The break landed entirely in PR #561.
- **Why the constants weren't updated in #561:** the test file's last three touches are `#594` (`f8c56219`), `#497` (`bb4ff9bb`), `#355` (`128ac947`, original add) — **#561 is not among them.** #561 added TS-parity scenarios to the *shared* feature file (its commit body cites "13 scenarios bound … via vitest-cucumber", "+5 @ts-effects", "+7 adapters") but never reconciled the Python-side count guard.

### Key finding that refines the fix surface

**The prove-it report the test docstrings tell you to "regenerate" does not exist.** Both assertion messages and the `test_tag_split_matches_prove_it_report` name reference `docs/proposals/issue-350-prove-it-report.md`. That file:
- is **absent from the working tree, this branch, `origin/main`, and all of `--all` history** (`git cat-file -e origin/main:docs/proposals/issue-350-prove-it-report.md` → "does not exist"; `git log --all --diff-filter=A -- '**/issue-350-prove-it-report.md'` → empty);
- in fact `docs/proposals/` **is not a tracked directory at all** in the working tree — a *sibling* doc `docs/proposals/issue-350-ralph-real-transports.md` is referenced by `python/scenario/voice/adapters/twilio.py:17` but is **also absent**.

So the issue body's prescribed remedy ("update the count(s) **and** regenerate the prove-it report") is half-moot: **there is no report to regenerate.** The references to it in the test are dangling. This is why the ACs below treat the report as "if-present" and add a regression AC to forbid re-introducing a stale constant without a single source of truth.

### Strategies considered

| Strategy | Result |
|---|---|
| Update the two hardcoded constants + their docstring breakdowns to `127` / `(79,13,35)` | **Works** — directly clears both failures; matches the live, internally-consistent contract. Minimal surface. |
| Regenerate `docs/proposals/issue-350-prove-it-report.md` per the docstrings | **Not applicable** — file never existed; nothing to regenerate. References are dangling. |
| Derive the counts dynamically (compute from the parsed feature, drop the hardcoded literals) | **Out of scope / changes intent.** The test is *deliberately* a tripwire: it forces a human decision whenever the contract changes (docstring: "Any change to this count must be a deliberate contract update"). Auto-deriving would make it tautological and silently absorb future scenario drift — the opposite of its purpose. Noted for the implementer to reject unless the user re-scopes. |

### Findings for the implementer

- **Touchpoints to change (all in `python/tests/voice/test_feature_file_contract.py`):**
  1. L62 assertion `== 108` → `== 127`; L63 message string `Expected 108`.
  2. L121 assertion `== (75, 8, 25)` → `== (79, 13, 35)`; L122 message `Expected 75 @unit / 8 @integration / 25 @e2e`.
  3. The **docstrings** (L47–59 and L108–113) narrate the old breakdown ("108 scenarios: 83 original + 4 + 12 + 3 + 3 + 3", "75 @unit / 8 @integration / 25 @e2e … Demo recordings add 1 @unit + 2 @integration …"). These are now wrong and must be updated so the next reader isn't misled. The exact +19 / +(4,5,10) attribution to #561's scenario set should be reflected, or the docstring simplified to state the current contract without a stale derivation.
- **Dangling-report references (L59, L65, L124, and the test *name* `test_tag_split_matches_prove_it_report`):** decide with the user whether to (a) drop the "regenerate the prove-it report" instruction since the file doesn't exist, or (b) keep the reference as an aspirational TODO. Do **not** silently leave assertion messages telling future engineers to update a file that isn't there — that's how this drift started.
- **Ruled out:** dynamic count derivation (changes the test's tripwire intent — see table); touching the feature file (the 127 scenarios are correct; the *test* is stale, not the spec); touching any production `scenario/` code (none involved).
- **No migration / data / external-call surface.** Pure test-constant edit; rollback is a one-line git revert. Rollback AC omitted (no irreversible operation).

### Caveats

- Verify the live numbers at fix time with a fresh `uv run pytest tests/voice/test_feature_file_contract.py` — if another PR lands between now and the fix and changes the feature file again, the constants must match *that* HEAD, not the 127/(79,13,35) snapshot captured here.
- The two raw `grep -c "@integration"` / `grep -c "@e2e"` line-counts over the feature file give 14 / 39, which differ from the test's 13 / 35. The discrepancy is expected: the test counts *scenarios carrying the tag* via the Gherkin parser, while `grep -c` counts *line occurrences* (feature/Rule-level tags, multi-tag lines, or `@e2e` appearing in tags like `@ts-e2e`). **The test's parser-based numbers `(79, 13, 35)` are authoritative** — do not "correct" them to the grep values.

## Acceptance Criteria

- **AC1** — `cd python && uv run pytest tests/voice/test_feature_file_contract.py` exits `0` with all 5 tests passing (previously red on `main`). Evidence: pytest stdout showing `5 passed`.
- **AC2** — `test_feature_file_declares_expected_scenario_count` asserts the **live** scenario count of `specs/voice-agents.feature` (currently 127), not a stale literal. Evidence: the assertion's expected value equals `grep -cE "^\s*Scenario:" specs/voice-agents.feature` at the fix HEAD; pytest green.
- **AC3** — `test_tag_split_matches_prove_it_report` asserts the live per-tag split (currently `(79, 13, 35)`) and the three sub-counts sum to the total in AC2. Evidence: assertion tuple == parser-computed `(unit, integration, e2e)`; pytest green.
- **AC4** — The docstrings in both updated tests describe the **current** contract (no leftover "108"/"75/8/25" derivations that contradict the new assertions). Evidence: `grep -nE "108|75 @unit|\(75, 8, 25\)" python/tests/voice/test_feature_file_contract.py` returns no matches anywhere in the file (the file has no other legitimate use of those literals at fix HEAD — confirmed; a file-wide grep is therefore an exact proxy for "no stale derivation in the two tests").

### Consequence & failure-mode coverage

- **AC5 (regression — adjacent guards still hold):** The other three contract tests (`test_feature_file_parses_cleanly`, `test_every_scenario_has_at_least_one_given_and_one_then`, `test_every_scenario_is_tagged_unit_integration_or_e2e`) continue to pass unchanged — the fix touches only the two count assertions, not the parsing or tagging guards. Evidence: full-file pytest run shows all 5 green; `git diff` touches only the two count tests + their docstrings.
- **AC6 (regression — CI baseline restored):** `python-ci` `test (3.12)` job is green on the PR branch, so the gate that #561/#604/#386 left red is cleared for subsequent PRs. Evidence: GitHub checks rollup on the PR showing python-ci ✅.
- **AC7 (downstream / dangling-reference reconciliation):** The fix does not leave assertion messages or test names instructing engineers to update `docs/proposals/issue-350-prove-it-report.md` while that file is absent from the repo. The references are **removed or softened** so the test no longer prescribes regenerating a nonexistent file. **Creating a stub report to satisfy the reference is explicitly rejected** — that would re-introduce the original drift (a file the test points at that nobody maintains), the same anti-pattern AC8 rejects for dynamic derivation. Evidence: `grep -rn "issue-350-prove-it-report.md" python/tests/voice/test_feature_file_contract.py` shows every remaining mention (if any) is descriptive prose, not an actionable "regenerate this file" instruction; no new file is created under `docs/proposals/` (shown in `git diff --stat`).
- **AC8 (failure-mode — the tripwire still fires on real drift):** The test remains a deliberate tripwire, not auto-derived: any scenario-count delta in `specs/voice-agents.feature` without a matching constant update still fails `test_feature_file_declares_expected_scenario_count` with the actionable message "If this is an intentional contract change, update the count …". Evidence: a local mutation check confirms the test fails with that message on a count delta, then the mutation is reverted. (Proof step only; no committed change — mechanism left to the prover.)

## Plan

**Approach:** A two-constant test fix plus a dangling-reference cleanup, all in one file (`python/tests/voice/test_feature_file_contract.py`). Re-derive the live numbers from `specs/voice-agents.feature` at fix HEAD, update the two stale assertions (`108`→live total, `(75,8,25)`→live split) and their explanatory docstrings, soften the references to the nonexistent prove-it report so the test no longer prescribes regenerating a file that isn't in the repo, then verify the full 5-test suite green locally and via python-ci. **No production code changes; no feature-file changes** — the 127 scenarios are correct, the *guard* drifted.

**Sequencing:**
1. **Re-derive at HEAD.** Run `uv run pytest tests/voice/test_feature_file_contract.py` and capture the failure actuals; cross-check the total against `grep -cE "^\s*Scenario:" specs/voice-agents.feature`. This is the source of truth for the constants — not the 127/(79,13,35) snapshot in the Investigation, which could be stale if another PR landed. *Completion signal:* two confirmed actual tuples that are internally consistent (split sums to total). (Satisfies the AC2/AC3 "live count" requirement at the right HEAD.)
2. **Update the two constants + their docstrings.** Edit the `== 108` assertion + its message, the `== (75, 8, 25)` assertion + its message, and the two docstrings so their narrated breakdowns describe the current contract (no leftover "108 / 75-8-25" derivation). *Completion signal:* `git diff` shows only those two tests + docstrings changed. (AC2, AC3, AC4.)
3. **Reconcile the dangling prove-it-report references.** Soften/remove the "regenerate `docs/proposals/issue-350-prove-it-report.md`" instructions in the two assertion messages and docstrings so they no longer prescribe touching a file absent from the repo. Do **not** create a stub report. *Completion signal:* `grep -rn "issue-350-prove-it-report.md"` in the test shows only descriptive prose (if anything); `git diff --stat` creates no file under `docs/proposals/`. (AC7.)
4. **Verify + prove the tripwire.** Run the full file → 5 passed. Run a throwaway mutation (append a dummy `Scenario:` to the feature file) to confirm the count test still fails with its actionable message, then revert. Confirm python-ci `test (3.12)` green on the PR. *Completion signal:* `5 passed` locally; mutation reverted; CI rollup ✅. (AC1, AC5, AC6, AC8.)

**Key decisions:**
- **Keep the hardcoded literals; do NOT auto-derive the counts.** The test is a deliberate tripwire — its docstring says "Any change to this count must be a deliberate contract update." Auto-deriving would make it tautological and silently swallow future scenario drift (which is exactly the failure mode AC8 guards against). Rationale recorded so a reviewer doesn't propose the "cleaner" auto-derive.
- **Soften the prove-it-report references rather than create the report.** The file never existed in any history; creating a stub would re-introduce the original drift (a referenced-but-unmaintained file). Reword to descriptive prose instead.
- **Re-derive numbers at fix HEAD, not from this issue body.** Guards against a concurrent feature-file change between investigation and fix.

**Risks / open questions:**
- *A concurrent PR changes `specs/voice-agents.feature` before this lands* → constants must match the new HEAD. Mitigated by Phase 1 re-derivation; the ACs are phrased as "the live count," not the literal 127.
- *The exact docstring rewording* (full +19/(4,5,10) attribution vs. a simplified current-contract statement) is a style choice left to `/spec`/implementation; either satisfies AC4.

**Out of scope:**
- Auto-deriving the counts / restructuring the test into a dynamic check (changes the tripwire's intent — explicitly rejected above).
- Touching `specs/voice-agents.feature` (the 127 scenarios are correct).
- The *second* dangling proposal-doc reference at `python/scenario/voice/adapters/twilio.py:17` (`issue-350-ralph-real-transports.md`, also absent) — same class, but production code and unrelated to the CI-red failure. Tracked as a separate follow-up issue.
- Creating `docs/proposals/issue-350-prove-it-report.md` (rejected — would re-introduce drift).

---
*Classification: Bug*
*Status: investigation + plan complete — ACs validated (ac-reviewer: zero Must-Fix). Proceeding to /spec.*



Strategy	Result
Update the two hardcoded constants + their docstring breakdowns to `127` / `(79,13,35)`	Works — directly clears both failures; matches the live, internally-consistent contract. Minimal surface.
Regenerate `docs/proposals/issue-350-prove-it-report.md` per the docstrings	Not applicable — file never existed; nothing to regenerate. References are dangling.
Derive the counts dynamically (compute from the parsed feature, drop the hardcoded literals)	Out of scope / changes intent. The test is deliberately a tripwire: it forces a human decision whenever the contract changes (docstring: "Any change to this count must be a deliberate contract update"). Auto-deriving would make it tautological and silently absorb future scenario drift — the opposite of its purpose. Noted for the implementer to reject unless the user re-scopes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice): main python-ci red — stale feature-file contract counts (108→127) after #561 #609

Expected

Actual

Reproduction

Investigation

Evidence (run this turn, worktree `issue609/...`, HEAD `1c5a66c7`)

Key finding that refines the fix surface

Strategies considered

Findings for the implementer

Caveats

Acceptance Criteria

Consequence & failure-mode coverage

Plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(voice): main python-ci red — stale feature-file contract counts (108→127) after #561 #609

Description

Expected

Actual

Reproduction

Investigation

Evidence (run this turn, worktree issue609/..., HEAD 1c5a66c7)

Key finding that refines the fix surface

Strategies considered

Findings for the implementer

Caveats

Acceptance Criteria

Consequence & failure-mode coverage

Plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence (run this turn, worktree `issue609/...`, HEAD `1c5a66c7`)