Skip to content

fix(voice): main python-ci red — stale feature-file contract counts (108→127) after #561 #609

@drewdrewthis

Description

@drewdrewthis

main python-ci is red and blocking CI for all scenario PRs. Two voice feature-file contract tests assert hardcoded scenario counts that the test suite has since outgrown.

Expected

tests/voice/test_feature_file_contract.py passes on main; python-ci is green so PRs can merge against a passing baseline.

Actual

Two tests fail:

  • test_feature_file_declares_expected_scenario_count — asserts len(scenarios) == 108, but the feature files now contain 127.
  • test_tag_split_matches_prove_it_report — asserts (unit, integration, e2e) == (75, 8, 25), but the actual split is (79, 13, 35).

Both assertion messages prescribe the remedy: update the count(s) in the test and regenerate docs/proposals/issue-350-prove-it-report.md.

Reproduction

  • On main (latest 1c5a66c7), run the python-ci test (3.12) job — or cd python && uv run pytest tests/voice/test_feature_file_contract.py.
  • Root cause: the voice/ts-parity work that landed in PR feat(typescript-sdk): voice agent testing — consolidated clean stack #561 (merge commit 5847c4b4, 2026-06-04 15:10Z) added voice scenarios (ts-pipecat / LiveKit adapter scenarios) without updating the hardcoded counts in the contract test or the prove-it report.
  • main went red at 5847c4b4 and has stayed red across the subsequent merges (3765f3c5, 1c5a66c7) — ~3.5h with no fix in flight.

Investigation

Root cause confirmed (BLUF): python/tests/voice/test_feature_file_contract.py hardcodes two stale snapshots of specs/voice-agents.feature — a total scenario count (108) and a per-tag split (75 @unit / 8 @integration / 25 @e2e). PR #561 (5847c4b4) grew the feature file from 108 → 127 scenarios without touching these constants, so two assertions now fail and main python-ci is red. The fix is purely updating the constants (and their explanatory docstrings) to the live values 127 and (79, 13, 35). No production code is involved — this is a guard test that drifted behind the contract it guards.

Phase 2 early-exit: unambiguous defect, high confidence — both failures reproduced locally with exact actuals matching the issue body, and the 108→127 delta is git-bisected to a single commit. Phase 4 strategy enumeration was not exhausted because the defect is a literal stale-constant mismatch, not a design choice.

Evidence (run this turn, worktree issue609/..., HEAD 1c5a66c7)

Key finding that refines the fix surface

The prove-it report the test docstrings tell you to "regenerate" does not exist. Both assertion messages and the test_tag_split_matches_prove_it_report name reference docs/proposals/issue-350-prove-it-report.md. That file:

  • is absent from the working tree, this branch, origin/main, and all of --all history (git cat-file -e origin/main:docs/proposals/issue-350-prove-it-report.md → "does not exist"; git log --all --diff-filter=A -- '**/issue-350-prove-it-report.md' → empty);
  • in fact docs/proposals/ is not a tracked directory at all in the working tree — a sibling doc docs/proposals/issue-350-ralph-real-transports.md is referenced by python/scenario/voice/adapters/twilio.py:17 but is also absent.

So the issue body's prescribed remedy ("update the count(s) and regenerate the prove-it report") is half-moot: there is no report to regenerate. The references to it in the test are dangling. This is why the ACs below treat the report as "if-present" and add a regression AC to forbid re-introducing a stale constant without a single source of truth.

Strategies considered

Strategy Result
Update the two hardcoded constants + their docstring breakdowns to 127 / (79,13,35) Works — directly clears both failures; matches the live, internally-consistent contract. Minimal surface.
Regenerate docs/proposals/issue-350-prove-it-report.md per the docstrings Not applicable — file never existed; nothing to regenerate. References are dangling.
Derive the counts dynamically (compute from the parsed feature, drop the hardcoded literals) Out of scope / changes intent. The test is deliberately a tripwire: it forces a human decision whenever the contract changes (docstring: "Any change to this count must be a deliberate contract update"). Auto-deriving would make it tautological and silently absorb future scenario drift — the opposite of its purpose. Noted for the implementer to reject unless the user re-scopes.

Findings for the implementer

  • Touchpoints to change (all in python/tests/voice/test_feature_file_contract.py):
    1. L62 assertion == 108== 127; L63 message string Expected 108.
    2. L121 assertion == (75, 8, 25)== (79, 13, 35); L122 message Expected 75 @unit / 8 @integration / 25 @e2e.
    3. The docstrings (L47–59 and L108–113) narrate the old breakdown ("108 scenarios: 83 original + 4 + 12 + 3 + 3 + 3", "75 @Unit / 8 @integration / 25 @e2e … Demo recordings add 1 @Unit + 2 @integration …"). These are now wrong and must be updated so the next reader isn't misled. The exact +19 / +(4,5,10) attribution to feat(typescript-sdk): voice agent testing — consolidated clean stack #561's scenario set should be reflected, or the docstring simplified to state the current contract without a stale derivation.
  • Dangling-report references (L59, L65, L124, and the test name test_tag_split_matches_prove_it_report): decide with the user whether to (a) drop the "regenerate the prove-it report" instruction since the file doesn't exist, or (b) keep the reference as an aspirational TODO. Do not silently leave assertion messages telling future engineers to update a file that isn't there — that's how this drift started.
  • Ruled out: dynamic count derivation (changes the test's tripwire intent — see table); touching the feature file (the 127 scenarios are correct; the test is stale, not the spec); touching any production scenario/ code (none involved).
  • No migration / data / external-call surface. Pure test-constant edit; rollback is a one-line git revert. Rollback AC omitted (no irreversible operation).

Caveats

  • Verify the live numbers at fix time with a fresh uv run pytest tests/voice/test_feature_file_contract.py — if another PR lands between now and the fix and changes the feature file again, the constants must match that HEAD, not the 127/(79,13,35) snapshot captured here.
  • The two raw grep -c "@integration" / grep -c "@e2e" line-counts over the feature file give 14 / 39, which differ from the test's 13 / 35. The discrepancy is expected: the test counts scenarios carrying the tag via the Gherkin parser, while grep -c counts line occurrences (feature/Rule-level tags, multi-tag lines, or @e2e appearing in tags like @ts-e2e). The test's parser-based numbers (79, 13, 35) are authoritative — do not "correct" them to the grep values.

Acceptance Criteria

  • AC1cd python && uv run pytest tests/voice/test_feature_file_contract.py exits 0 with all 5 tests passing (previously red on main). Evidence: pytest stdout showing 5 passed.
  • AC2test_feature_file_declares_expected_scenario_count asserts the live scenario count of specs/voice-agents.feature (currently 127), not a stale literal. Evidence: the assertion's expected value equals grep -cE "^\s*Scenario:" specs/voice-agents.feature at the fix HEAD; pytest green.
  • AC3test_tag_split_matches_prove_it_report asserts the live per-tag split (currently (79, 13, 35)) and the three sub-counts sum to the total in AC2. Evidence: assertion tuple == parser-computed (unit, integration, e2e); pytest green.
  • AC4 — The docstrings in both updated tests describe the current contract (no leftover "108"/"75/8/25" derivations that contradict the new assertions). Evidence: grep -nE "108|75 @unit|\(75, 8, 25\)" python/tests/voice/test_feature_file_contract.py returns no matches anywhere in the file (the file has no other legitimate use of those literals at fix HEAD — confirmed; a file-wide grep is therefore an exact proxy for "no stale derivation in the two tests").

Consequence & failure-mode coverage

  • AC5 (regression — adjacent guards still hold): The other three contract tests (test_feature_file_parses_cleanly, test_every_scenario_has_at_least_one_given_and_one_then, test_every_scenario_is_tagged_unit_integration_or_e2e) continue to pass unchanged — the fix touches only the two count assertions, not the parsing or tagging guards. Evidence: full-file pytest run shows all 5 green; git diff touches only the two count tests + their docstrings.
  • AC6 (regression — CI baseline restored): python-ci test (3.12) job is green on the PR branch, so the gate that feat(typescript-sdk): voice agent testing — consolidated clean stack #561/fix(voice/#602): migrate OpenAIRealtimeAgentAdapter to GA Realtime wire protocol #604/chore(main): release javascript 0.4.12 #386 left red is cleared for subsequent PRs. Evidence: GitHub checks rollup on the PR showing python-ci ✅.
  • AC7 (downstream / dangling-reference reconciliation): The fix does not leave assertion messages or test names instructing engineers to update docs/proposals/issue-350-prove-it-report.md while that file is absent from the repo. The references are removed or softened so the test no longer prescribes regenerating a nonexistent file. Creating a stub report to satisfy the reference is explicitly rejected — that would re-introduce the original drift (a file the test points at that nobody maintains), the same anti-pattern AC8 rejects for dynamic derivation. Evidence: grep -rn "issue-350-prove-it-report.md" python/tests/voice/test_feature_file_contract.py shows every remaining mention (if any) is descriptive prose, not an actionable "regenerate this file" instruction; no new file is created under docs/proposals/ (shown in git diff --stat).
  • AC8 (failure-mode — the tripwire still fires on real drift): The test remains a deliberate tripwire, not auto-derived: any scenario-count delta in specs/voice-agents.feature without a matching constant update still fails test_feature_file_declares_expected_scenario_count with the actionable message "If this is an intentional contract change, update the count …". Evidence: a local mutation check confirms the test fails with that message on a count delta, then the mutation is reverted. (Proof step only; no committed change — mechanism left to the prover.)

Plan

Approach: A two-constant test fix plus a dangling-reference cleanup, all in one file (python/tests/voice/test_feature_file_contract.py). Re-derive the live numbers from specs/voice-agents.feature at fix HEAD, update the two stale assertions (108→live total, (75,8,25)→live split) and their explanatory docstrings, soften the references to the nonexistent prove-it report so the test no longer prescribes regenerating a file that isn't in the repo, then verify the full 5-test suite green locally and via python-ci. No production code changes; no feature-file changes — the 127 scenarios are correct, the guard drifted.

Sequencing:

  1. Re-derive at HEAD. Run uv run pytest tests/voice/test_feature_file_contract.py and capture the failure actuals; cross-check the total against grep -cE "^\s*Scenario:" specs/voice-agents.feature. This is the source of truth for the constants — not the 127/(79,13,35) snapshot in the Investigation, which could be stale if another PR landed. Completion signal: two confirmed actual tuples that are internally consistent (split sums to total). (Satisfies the AC2/AC3 "live count" requirement at the right HEAD.)
  2. Update the two constants + their docstrings. Edit the == 108 assertion + its message, the == (75, 8, 25) assertion + its message, and the two docstrings so their narrated breakdowns describe the current contract (no leftover "108 / 75-8-25" derivation). Completion signal: git diff shows only those two tests + docstrings changed. (AC2, AC3, AC4.)
  3. Reconcile the dangling prove-it-report references. Soften/remove the "regenerate docs/proposals/issue-350-prove-it-report.md" instructions in the two assertion messages and docstrings so they no longer prescribe touching a file absent from the repo. Do not create a stub report. Completion signal: grep -rn "issue-350-prove-it-report.md" in the test shows only descriptive prose (if anything); git diff --stat creates no file under docs/proposals/. (AC7.)
  4. Verify + prove the tripwire. Run the full file → 5 passed. Run a throwaway mutation (append a dummy Scenario: to the feature file) to confirm the count test still fails with its actionable message, then revert. Confirm python-ci test (3.12) green on the PR. Completion signal: 5 passed locally; mutation reverted; CI rollup ✅. (AC1, AC5, AC6, AC8.)

Key decisions:

  • Keep the hardcoded literals; do NOT auto-derive the counts. The test is a deliberate tripwire — its docstring says "Any change to this count must be a deliberate contract update." Auto-deriving would make it tautological and silently swallow future scenario drift (which is exactly the failure mode AC8 guards against). Rationale recorded so a reviewer doesn't propose the "cleaner" auto-derive.
  • Soften the prove-it-report references rather than create the report. The file never existed in any history; creating a stub would re-introduce the original drift (a referenced-but-unmaintained file). Reword to descriptive prose instead.
  • Re-derive numbers at fix HEAD, not from this issue body. Guards against a concurrent feature-file change between investigation and fix.

Risks / open questions:

  • A concurrent PR changes specs/voice-agents.feature before this lands → constants must match the new HEAD. Mitigated by Phase 1 re-derivation; the ACs are phrased as "the live count," not the literal 127.
  • The exact docstring rewording (full +19/(4,5,10) attribution vs. a simplified current-contract statement) is a style choice left to /spec/implementation; either satisfies AC4.

Out of scope:

  • Auto-deriving the counts / restructuring the test into a dynamic check (changes the tripwire's intent — explicitly rejected above).
  • Touching specs/voice-agents.feature (the 127 scenarios are correct).
  • The second dangling proposal-doc reference at python/scenario/voice/adapters/twilio.py:17 (issue-350-ralph-real-transports.md, also absent) — same class, but production code and unrelated to the CI-red failure. Tracked as a separate follow-up issue.
  • Creating docs/proposals/issue-350-prove-it-report.md (rejected — would re-introduce drift).

Classification: Bug
Status: investigation + plan complete — ACs validated (ac-reviewer: zero Must-Fix). Proceeding to /spec.

Metadata

Metadata

Assignees

Labels

P1 - highHigh priority, work on this soonblockedWorkflow: blockedbugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions