You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
main python-ci is red and blocking CI for all scenario PRs. Two voice feature-file contract tests assert hardcoded scenario counts that the test suite has since outgrown.
Expected
tests/voice/test_feature_file_contract.py passes on main; python-ci is green so PRs can merge against a passing baseline.
Actual
Two tests fail:
test_feature_file_declares_expected_scenario_count — asserts len(scenarios) == 108, but the feature files now contain 127.
test_tag_split_matches_prove_it_report — asserts (unit, integration, e2e) == (75, 8, 25), but the actual split is (79, 13, 35).
Both assertion messages prescribe the remedy: update the count(s) in the test and regenerate docs/proposals/issue-350-prove-it-report.md.
Reproduction
On main (latest 1c5a66c7), run the python-ci test (3.12) job — or cd python && uv run pytest tests/voice/test_feature_file_contract.py.
Root cause: the voice/ts-parity work that landed in PR feat(typescript-sdk): voice agent testing — consolidated clean stack #561 (merge commit 5847c4b4, 2026-06-04 15:10Z) added voice scenarios (ts-pipecat / LiveKit adapter scenarios) without updating the hardcoded counts in the contract test or the prove-it report.
main went red at 5847c4b4 and has stayed red across the subsequent merges (3765f3c5, 1c5a66c7) — ~3.5h with no fix in flight.
Investigation
Root cause confirmed (BLUF):python/tests/voice/test_feature_file_contract.py hardcodes two stale snapshots of specs/voice-agents.feature — a total scenario count (108) and a per-tag split (75 @unit / 8 @integration / 25 @e2e). PR #561 (5847c4b4) grew the feature file from 108 → 127 scenarios without touching these constants, so two assertions now fail and main python-ci is red. The fix is purely updating the constants (and their explanatory docstrings) to the live values127 and (79, 13, 35). No production code is involved — this is a guard test that drifted behind the contract it guards.
Phase 2 early-exit: unambiguous defect, high confidence — both failures reproduced locally with exact actuals matching the issue body, and the 108→127 delta is git-bisected to a single commit. Phase 4 strategy enumeration was not exhausted because the defect is a literal stale-constant mismatch, not a design choice.
Evidence (run this turn, worktree issue609/..., HEAD 1c5a66c7)
Split is internally consistent:79 + 13 + 35 = 127 — the new total and the new split agree. (Every scenario is tagged exactly one of the three; test_every_scenario_is_tagged_unit_integration_or_e2e passes, confirming no untagged or double-counted scenarios.)
The prove-it report the test docstrings tell you to "regenerate" does not exist. Both assertion messages and the test_tag_split_matches_prove_it_report name reference docs/proposals/issue-350-prove-it-report.md. That file:
is absent from the working tree, this branch, origin/main, and all of --all history (git cat-file -e origin/main:docs/proposals/issue-350-prove-it-report.md → "does not exist"; git log --all --diff-filter=A -- '**/issue-350-prove-it-report.md' → empty);
in fact docs/proposals/is not a tracked directory at all in the working tree — a sibling doc docs/proposals/issue-350-ralph-real-transports.md is referenced by python/scenario/voice/adapters/twilio.py:17 but is also absent.
So the issue body's prescribed remedy ("update the count(s) and regenerate the prove-it report") is half-moot: there is no report to regenerate. The references to it in the test are dangling. This is why the ACs below treat the report as "if-present" and add a regression AC to forbid re-introducing a stale constant without a single source of truth.
Strategies considered
Strategy
Result
Update the two hardcoded constants + their docstring breakdowns to 127 / (79,13,35)
Works — directly clears both failures; matches the live, internally-consistent contract. Minimal surface.
Regenerate docs/proposals/issue-350-prove-it-report.md per the docstrings
Not applicable — file never existed; nothing to regenerate. References are dangling.
Derive the counts dynamically (compute from the parsed feature, drop the hardcoded literals)
Out of scope / changes intent. The test is deliberately a tripwire: it forces a human decision whenever the contract changes (docstring: "Any change to this count must be a deliberate contract update"). Auto-deriving would make it tautological and silently absorb future scenario drift — the opposite of its purpose. Noted for the implementer to reject unless the user re-scopes.
Findings for the implementer
Touchpoints to change (all in python/tests/voice/test_feature_file_contract.py):
The docstrings (L47–59 and L108–113) narrate the old breakdown ("108 scenarios: 83 original + 4 + 12 + 3 + 3 + 3", "75 @Unit / 8 @integration / 25 @e2e … Demo recordings add 1 @Unit + 2 @integration …"). These are now wrong and must be updated so the next reader isn't misled. The exact +19 / +(4,5,10) attribution to feat(typescript-sdk): voice agent testing — consolidated clean stack #561's scenario set should be reflected, or the docstring simplified to state the current contract without a stale derivation.
Dangling-report references (L59, L65, L124, and the test nametest_tag_split_matches_prove_it_report): decide with the user whether to (a) drop the "regenerate the prove-it report" instruction since the file doesn't exist, or (b) keep the reference as an aspirational TODO. Do not silently leave assertion messages telling future engineers to update a file that isn't there — that's how this drift started.
Ruled out: dynamic count derivation (changes the test's tripwire intent — see table); touching the feature file (the 127 scenarios are correct; the test is stale, not the spec); touching any production scenario/ code (none involved).
No migration / data / external-call surface. Pure test-constant edit; rollback is a one-line git revert. Rollback AC omitted (no irreversible operation).
Caveats
Verify the live numbers at fix time with a fresh uv run pytest tests/voice/test_feature_file_contract.py — if another PR lands between now and the fix and changes the feature file again, the constants must match that HEAD, not the 127/(79,13,35) snapshot captured here.
The two raw grep -c "@integration" / grep -c "@e2e" line-counts over the feature file give 14 / 39, which differ from the test's 13 / 35. The discrepancy is expected: the test counts scenarios carrying the tag via the Gherkin parser, while grep -c counts line occurrences (feature/Rule-level tags, multi-tag lines, or @e2e appearing in tags like @ts-e2e). The test's parser-based numbers (79, 13, 35) are authoritative — do not "correct" them to the grep values.
Acceptance Criteria
AC1 — cd python && uv run pytest tests/voice/test_feature_file_contract.py exits 0 with all 5 tests passing (previously red on main). Evidence: pytest stdout showing 5 passed.
AC2 — test_feature_file_declares_expected_scenario_count asserts the live scenario count of specs/voice-agents.feature (currently 127), not a stale literal. Evidence: the assertion's expected value equals grep -cE "^\s*Scenario:" specs/voice-agents.feature at the fix HEAD; pytest green.
AC3 — test_tag_split_matches_prove_it_report asserts the live per-tag split (currently (79, 13, 35)) and the three sub-counts sum to the total in AC2. Evidence: assertion tuple == parser-computed (unit, integration, e2e); pytest green.
AC4 — The docstrings in both updated tests describe the current contract (no leftover "108"/"75/8/25" derivations that contradict the new assertions). Evidence: grep -nE "108|75 @unit|\(75, 8, 25\)" python/tests/voice/test_feature_file_contract.py returns no matches anywhere in the file (the file has no other legitimate use of those literals at fix HEAD — confirmed; a file-wide grep is therefore an exact proxy for "no stale derivation in the two tests").
Consequence & failure-mode coverage
AC5 (regression — adjacent guards still hold): The other three contract tests (test_feature_file_parses_cleanly, test_every_scenario_has_at_least_one_given_and_one_then, test_every_scenario_is_tagged_unit_integration_or_e2e) continue to pass unchanged — the fix touches only the two count assertions, not the parsing or tagging guards. Evidence: full-file pytest run shows all 5 green; git diff touches only the two count tests + their docstrings.
AC7 (downstream / dangling-reference reconciliation): The fix does not leave assertion messages or test names instructing engineers to update docs/proposals/issue-350-prove-it-report.md while that file is absent from the repo. The references are removed or softened so the test no longer prescribes regenerating a nonexistent file. Creating a stub report to satisfy the reference is explicitly rejected — that would re-introduce the original drift (a file the test points at that nobody maintains), the same anti-pattern AC8 rejects for dynamic derivation. Evidence: grep -rn "issue-350-prove-it-report.md" python/tests/voice/test_feature_file_contract.py shows every remaining mention (if any) is descriptive prose, not an actionable "regenerate this file" instruction; no new file is created under docs/proposals/ (shown in git diff --stat).
AC8 (failure-mode — the tripwire still fires on real drift): The test remains a deliberate tripwire, not auto-derived: any scenario-count delta in specs/voice-agents.feature without a matching constant update still fails test_feature_file_declares_expected_scenario_count with the actionable message "If this is an intentional contract change, update the count …". Evidence: a local mutation check confirms the test fails with that message on a count delta, then the mutation is reverted. (Proof step only; no committed change — mechanism left to the prover.)
Plan
Approach: A two-constant test fix plus a dangling-reference cleanup, all in one file (python/tests/voice/test_feature_file_contract.py). Re-derive the live numbers from specs/voice-agents.feature at fix HEAD, update the two stale assertions (108→live total, (75,8,25)→live split) and their explanatory docstrings, soften the references to the nonexistent prove-it report so the test no longer prescribes regenerating a file that isn't in the repo, then verify the full 5-test suite green locally and via python-ci. No production code changes; no feature-file changes — the 127 scenarios are correct, the guard drifted.
Sequencing:
Re-derive at HEAD. Run uv run pytest tests/voice/test_feature_file_contract.py and capture the failure actuals; cross-check the total against grep -cE "^\s*Scenario:" specs/voice-agents.feature. This is the source of truth for the constants — not the 127/(79,13,35) snapshot in the Investigation, which could be stale if another PR landed. Completion signal: two confirmed actual tuples that are internally consistent (split sums to total). (Satisfies the AC2/AC3 "live count" requirement at the right HEAD.)
Update the two constants + their docstrings. Edit the == 108 assertion + its message, the == (75, 8, 25) assertion + its message, and the two docstrings so their narrated breakdowns describe the current contract (no leftover "108 / 75-8-25" derivation). Completion signal:git diff shows only those two tests + docstrings changed. (AC2, AC3, AC4.)
Reconcile the dangling prove-it-report references. Soften/remove the "regenerate docs/proposals/issue-350-prove-it-report.md" instructions in the two assertion messages and docstrings so they no longer prescribe touching a file absent from the repo. Do not create a stub report. Completion signal:grep -rn "issue-350-prove-it-report.md" in the test shows only descriptive prose (if anything); git diff --stat creates no file under docs/proposals/. (AC7.)
Verify + prove the tripwire. Run the full file → 5 passed. Run a throwaway mutation (append a dummy Scenario: to the feature file) to confirm the count test still fails with its actionable message, then revert. Confirm python-ci test (3.12) green on the PR. Completion signal:5 passed locally; mutation reverted; CI rollup ✅. (AC1, AC5, AC6, AC8.)
Key decisions:
Keep the hardcoded literals; do NOT auto-derive the counts. The test is a deliberate tripwire — its docstring says "Any change to this count must be a deliberate contract update." Auto-deriving would make it tautological and silently swallow future scenario drift (which is exactly the failure mode AC8 guards against). Rationale recorded so a reviewer doesn't propose the "cleaner" auto-derive.
Soften the prove-it-report references rather than create the report. The file never existed in any history; creating a stub would re-introduce the original drift (a referenced-but-unmaintained file). Reword to descriptive prose instead.
Re-derive numbers at fix HEAD, not from this issue body. Guards against a concurrent feature-file change between investigation and fix.
Risks / open questions:
A concurrent PR changes specs/voice-agents.feature before this lands → constants must match the new HEAD. Mitigated by Phase 1 re-derivation; the ACs are phrased as "the live count," not the literal 127.
The exact docstring rewording (full +19/(4,5,10) attribution vs. a simplified current-contract statement) is a style choice left to /spec/implementation; either satisfies AC4.
Out of scope:
Auto-deriving the counts / restructuring the test into a dynamic check (changes the tripwire's intent — explicitly rejected above).
Touching specs/voice-agents.feature (the 127 scenarios are correct).
The second dangling proposal-doc reference at python/scenario/voice/adapters/twilio.py:17 (issue-350-ralph-real-transports.md, also absent) — same class, but production code and unrelated to the CI-red failure. Tracked as a separate follow-up issue.
Creating docs/proposals/issue-350-prove-it-report.md (rejected — would re-introduce drift).
Classification: Bug Status: investigation + plan complete — ACs validated (ac-reviewer: zero Must-Fix). Proceeding to /spec.
mainpython-ci is red and blocking CI for all scenario PRs. Two voice feature-file contract tests assert hardcoded scenario counts that the test suite has since outgrown.Expected
tests/voice/test_feature_file_contract.pypasses onmain;python-ciis green so PRs can merge against a passing baseline.Actual
Two tests fail:
test_feature_file_declares_expected_scenario_count— assertslen(scenarios) == 108, but the feature files now contain 127.test_tag_split_matches_prove_it_report— asserts(unit, integration, e2e) == (75, 8, 25), but the actual split is (79, 13, 35).Both assertion messages prescribe the remedy: update the count(s) in the test and regenerate
docs/proposals/issue-350-prove-it-report.md.Reproduction
main(latest1c5a66c7), run the python-citest (3.12)job — orcd python && uv run pytest tests/voice/test_feature_file_contract.py.5847c4b4, 2026-06-04 15:10Z) added voice scenarios (ts-pipecat / LiveKit adapter scenarios) without updating the hardcoded counts in the contract test or the prove-it report.5847c4b4and has stayed red across the subsequent merges (3765f3c5,1c5a66c7) — ~3.5h with no fix in flight.Investigation
Root cause confirmed (BLUF):
python/tests/voice/test_feature_file_contract.pyhardcodes two stale snapshots ofspecs/voice-agents.feature— a total scenario count (108) and a per-tag split (75 @unit / 8 @integration / 25 @e2e). PR #561 (5847c4b4) grew the feature file from 108 → 127 scenarios without touching these constants, so two assertions now fail andmainpython-ci is red. The fix is purely updating the constants (and their explanatory docstrings) to the live values127and(79, 13, 35). No production code is involved — this is a guard test that drifted behind the contract it guards.Phase 2 early-exit: unambiguous defect, high confidence — both failures reproduced locally with exact actuals matching the issue body, and the 108→127 delta is git-bisected to a single commit. Phase 4 strategy enumeration was not exhausted because the defect is a literal stale-constant mismatch, not a design choice.
Evidence (run this turn, worktree
issue609/..., HEAD1c5a66c7)test_feature_file_declares_expected_scenario_count:AssertionError: Expected 108 scenarios; found 127. … assert 127 == 108test_tag_split_matches_prove_it_report:AssertionError: Expected 75 @unit / 8 @integration / 25 @e2e; found 79 / 13 / 35. … assert (79, 13, 35) == (75, 8, 25)79 + 13 + 35 = 127— the new total and the new split agree. (Every scenario is tagged exactly one of the three;test_every_scenario_is_tagged_unit_integration_or_e2epasses, confirming no untagged or double-counted scenarios.)git show 5847c4b4~1:specs/voice-agents.feature | grep -c "^\s*Scenario:"→ 108;git show 5847c4b4:specs/voice-agents.feature | grep -c …→ 127. The break landed entirely in PR feat(typescript-sdk): voice agent testing — consolidated clean stack #561.#594(f8c56219),#497(bb4ff9bb),#355(128ac947, original add) — feat(typescript-sdk): voice agent testing — consolidated clean stack #561 is not among them. feat(typescript-sdk): voice agent testing — consolidated clean stack #561 added TS-parity scenarios to the shared feature file (its commit body cites "13 scenarios bound … via vitest-cucumber", "+5 @ts-effects", "+7 adapters") but never reconciled the Python-side count guard.Key finding that refines the fix surface
The prove-it report the test docstrings tell you to "regenerate" does not exist. Both assertion messages and the
test_tag_split_matches_prove_it_reportname referencedocs/proposals/issue-350-prove-it-report.md. That file:origin/main, and all of--allhistory (git cat-file -e origin/main:docs/proposals/issue-350-prove-it-report.md→ "does not exist";git log --all --diff-filter=A -- '**/issue-350-prove-it-report.md'→ empty);docs/proposals/is not a tracked directory at all in the working tree — a sibling docdocs/proposals/issue-350-ralph-real-transports.mdis referenced bypython/scenario/voice/adapters/twilio.py:17but is also absent.So the issue body's prescribed remedy ("update the count(s) and regenerate the prove-it report") is half-moot: there is no report to regenerate. The references to it in the test are dangling. This is why the ACs below treat the report as "if-present" and add a regression AC to forbid re-introducing a stale constant without a single source of truth.
Strategies considered
127/(79,13,35)docs/proposals/issue-350-prove-it-report.mdper the docstringsFindings for the implementer
python/tests/voice/test_feature_file_contract.py):== 108→== 127; L63 message stringExpected 108.== (75, 8, 25)→== (79, 13, 35); L122 messageExpected 75 @unit / 8 @integration / 25 @e2e.test_tag_split_matches_prove_it_report): decide with the user whether to (a) drop the "regenerate the prove-it report" instruction since the file doesn't exist, or (b) keep the reference as an aspirational TODO. Do not silently leave assertion messages telling future engineers to update a file that isn't there — that's how this drift started.scenario/code (none involved).Caveats
uv run pytest tests/voice/test_feature_file_contract.py— if another PR lands between now and the fix and changes the feature file again, the constants must match that HEAD, not the 127/(79,13,35) snapshot captured here.grep -c "@integration"/grep -c "@e2e"line-counts over the feature file give 14 / 39, which differ from the test's 13 / 35. The discrepancy is expected: the test counts scenarios carrying the tag via the Gherkin parser, whilegrep -ccounts line occurrences (feature/Rule-level tags, multi-tag lines, or@e2eappearing in tags like@ts-e2e). The test's parser-based numbers(79, 13, 35)are authoritative — do not "correct" them to the grep values.Acceptance Criteria
cd python && uv run pytest tests/voice/test_feature_file_contract.pyexits0with all 5 tests passing (previously red onmain). Evidence: pytest stdout showing5 passed.test_feature_file_declares_expected_scenario_countasserts the live scenario count ofspecs/voice-agents.feature(currently 127), not a stale literal. Evidence: the assertion's expected value equalsgrep -cE "^\s*Scenario:" specs/voice-agents.featureat the fix HEAD; pytest green.test_tag_split_matches_prove_it_reportasserts the live per-tag split (currently(79, 13, 35)) and the three sub-counts sum to the total in AC2. Evidence: assertion tuple == parser-computed(unit, integration, e2e); pytest green.grep -nE "108|75 @unit|\(75, 8, 25\)" python/tests/voice/test_feature_file_contract.pyreturns no matches anywhere in the file (the file has no other legitimate use of those literals at fix HEAD — confirmed; a file-wide grep is therefore an exact proxy for "no stale derivation in the two tests").Consequence & failure-mode coverage
test_feature_file_parses_cleanly,test_every_scenario_has_at_least_one_given_and_one_then,test_every_scenario_is_tagged_unit_integration_or_e2e) continue to pass unchanged — the fix touches only the two count assertions, not the parsing or tagging guards. Evidence: full-file pytest run shows all 5 green;git difftouches only the two count tests + their docstrings.python-citest (3.12)job is green on the PR branch, so the gate that feat(typescript-sdk): voice agent testing — consolidated clean stack #561/fix(voice/#602): migrate OpenAIRealtimeAgentAdapter to GA Realtime wire protocol #604/chore(main): release javascript 0.4.12 #386 left red is cleared for subsequent PRs. Evidence: GitHub checks rollup on the PR showing python-ci ✅.docs/proposals/issue-350-prove-it-report.mdwhile that file is absent from the repo. The references are removed or softened so the test no longer prescribes regenerating a nonexistent file. Creating a stub report to satisfy the reference is explicitly rejected — that would re-introduce the original drift (a file the test points at that nobody maintains), the same anti-pattern AC8 rejects for dynamic derivation. Evidence:grep -rn "issue-350-prove-it-report.md" python/tests/voice/test_feature_file_contract.pyshows every remaining mention (if any) is descriptive prose, not an actionable "regenerate this file" instruction; no new file is created underdocs/proposals/(shown ingit diff --stat).specs/voice-agents.featurewithout a matching constant update still failstest_feature_file_declares_expected_scenario_countwith the actionable message "If this is an intentional contract change, update the count …". Evidence: a local mutation check confirms the test fails with that message on a count delta, then the mutation is reverted. (Proof step only; no committed change — mechanism left to the prover.)Plan
Approach: A two-constant test fix plus a dangling-reference cleanup, all in one file (
python/tests/voice/test_feature_file_contract.py). Re-derive the live numbers fromspecs/voice-agents.featureat fix HEAD, update the two stale assertions (108→live total,(75,8,25)→live split) and their explanatory docstrings, soften the references to the nonexistent prove-it report so the test no longer prescribes regenerating a file that isn't in the repo, then verify the full 5-test suite green locally and via python-ci. No production code changes; no feature-file changes — the 127 scenarios are correct, the guard drifted.Sequencing:
uv run pytest tests/voice/test_feature_file_contract.pyand capture the failure actuals; cross-check the total againstgrep -cE "^\s*Scenario:" specs/voice-agents.feature. This is the source of truth for the constants — not the 127/(79,13,35) snapshot in the Investigation, which could be stale if another PR landed. Completion signal: two confirmed actual tuples that are internally consistent (split sums to total). (Satisfies the AC2/AC3 "live count" requirement at the right HEAD.)== 108assertion + its message, the== (75, 8, 25)assertion + its message, and the two docstrings so their narrated breakdowns describe the current contract (no leftover "108 / 75-8-25" derivation). Completion signal:git diffshows only those two tests + docstrings changed. (AC2, AC3, AC4.)docs/proposals/issue-350-prove-it-report.md" instructions in the two assertion messages and docstrings so they no longer prescribe touching a file absent from the repo. Do not create a stub report. Completion signal:grep -rn "issue-350-prove-it-report.md"in the test shows only descriptive prose (if anything);git diff --statcreates no file underdocs/proposals/. (AC7.)Scenario:to the feature file) to confirm the count test still fails with its actionable message, then revert. Confirm python-citest (3.12)green on the PR. Completion signal:5 passedlocally; mutation reverted; CI rollup ✅. (AC1, AC5, AC6, AC8.)Key decisions:
Risks / open questions:
specs/voice-agents.featurebefore this lands → constants must match the new HEAD. Mitigated by Phase 1 re-derivation; the ACs are phrased as "the live count," not the literal 127./spec/implementation; either satisfies AC4.Out of scope:
specs/voice-agents.feature(the 127 scenarios are correct).python/scenario/voice/adapters/twilio.py:17(issue-350-ralph-real-transports.md, also absent) — same class, but production code and unrelated to the CI-red failure. Tracked as a separate follow-up issue.docs/proposals/issue-350-prove-it-report.md(rejected — would re-introduce drift).Classification: Bug
Status: investigation + plan complete — ACs validated (ac-reviewer: zero Must-Fix). Proceeding to /spec.