Skip to content

fix(voice/ts): explicit EL ConvAI turn-commit so scripted next-turn receive re-engages#596

Open
drewdrewthis wants to merge 175 commits into
mainfrom
fix/567-el-convai-next-turn
Open

fix(voice/ts): explicit EL ConvAI turn-commit so scripted next-turn receive re-engages#596
drewdrewthis wants to merge 175 commits into
mainfrom
fix/567-el-convai-next-turn

Conversation

@drewdrewthis
Copy link
Copy Markdown
Collaborator

@drewdrewthis drewdrewthis commented Jun 1, 2026

Why

Addresses the core of #567 (the next-turn / post-interrupt receiveAudio timeout). The ElevenLabs hosted ConvAI adapter timed out on receiveAudio for scripted next-turn (and post-interrupt) cases: EL ConvAI 2.0's hybrid VAD + DL turn-detector doesn't treat a fixed ~333ms zero-byte silence tail as a deterministic end-of-turn on a scripted (non-mic) stream, so turn 2+ intermittently never re-engaged. This is the blocker that recently made a TS voice-demo attempt abandon hosted EL ConvAI and fall back to OpenAI Realtime — it's fixable, not an EL-side dead end.

Stacked on #561 (voice/372-refactor) — the voice stack isn't on main yet, so this targets that branch, not main. Merge after / into #561.

What changed

  • Explicit turn-commit over silence-VAD. Verified against both official EL SDKs that ConvAI exposes no audio-flush/commit event; user_message ({"type":"user_message","text":<transcript>}) is the only client→server event that deterministically forces an agent response without mic-VAD (and is the SDK's own wire shape, so it won't 400 the socket).
  • New adapter options turnCommitMode: "text" | "silence" (default "text") + configurable silenceTailBytes. Text mode sends only the user_message commit (audio is still recorded locally; EL echoes user_transcript, so lastUserTranscript/observability stay intact). "silence" preserves the legacy path; "text" with no transcript falls back to silence.
  • Untouched: ping/pong, transcript capture, agent_response_correction, drainPendingWaiters, capabilities, the wsFactory test seam.

Test plan

  • Unit (src/voice/__tests__/elevenlabs-turn-commit.test.ts, injectable fake socket): scripted 2nd user turn drives a 2nd receiveAudio resolution; exact user_message shape asserted; post-interrupt re-engages; both silence fallbacks. 7 pass; full voice suite 179 pass (21 files); examples typecheck clean.

How I can prove I was successful

  • Live, real EL socket, ≥2 exchanges, 3 consecutive clean runs (no retries, no timeouts): examples/vitest/tests/voice/elevenlabs-hosted.test.ts extended to greeting→user→agent→user→agent→judge. success=true, 5 segments, 3 agent / 2 user turns, 14.24s. The previously-timing-out 2nd turn now answers. (A first revision that streamed audio+text together flaked 2/3 — that drove the text-only commit; recorded here as the honest negative.)

Anything surprising?

  • Python parity (shipped, commit a878cc1): python/scenario/voice/adapters/elevenlabs.py had the identical bug (16000-byte tail, no commit); ported the same user_message turn-commit + turn_commit_mode/silence_tail_bytes options, mirroring the TS change. Live-verified ≥2 exchanges against the real EL socket (agent answered turn 2); unit 7 pass / 74 no-regression. TS + Python ship together in this PR.
  • Remaining fix(typescript-sdk/voice): harden ElevenLabs ConvAI post-interrupt / next-turn receive #567 scope (NOT in this PR): un-gating elevenlabs_interruption (still RUN_EL_INTERRUPTION=1-gated) needs live post-interrupt verification not yet done — tracked as follow-up, which is why this PR addresses rather than fully closes fix(typescript-sdk/voice): harden ElevenLabs ConvAI post-interrupt / next-turn receive #567. (elevenlabs_hosted IS extended to ≥2 exchanges here.)
  • Not un-gated: elevenlabs_interruption (the issue's secondary ask) — the post-interrupt path is covered by the unit test and the same commit should fix it, but I didn't live-run it. Worth a follow-up.

drewdrewthis and others added 30 commits May 31, 2026 15:19
…state)

Engineering Design Record for the TypeScript voice port (#372): the
inside-the-box design the PRD (API proposal) never specified. Pairs the
module tree + per-module contract catalog (target vs as-built gap analysis
across the voice PR series) with ADR-002, which moves STT/TTS provider
state off a module-global singleton onto per-run ScenarioConfig.voice
(the only per-run carrier that reaches AgentAdapter.call), removes the
invented scenario.configure({stt}) surface, and standardizes one in-message
audio format (fixing a live WAV-vs-PCM decode mismatch).

Spec only — no runtime change. The clean voice stack is built against this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports python/scenario/voice/{tts,stt,_transcribe}.py to TypeScript and
exposes scenario.configure({ stt }) for swapping the default STT provider.

- voice/tts.ts: synthesize(text, voice, effectFn?) + LRU(64) keyed on
  sha256(text)+voice. Effects apply AFTER cache hit per the locked
  decision; raw text never reaches the cache payload.
- voice/stt.ts: STTProvider interface, OpenAISTTProvider default
  (gpt-4o-transcribe) with 25-minute chunking, ElevenLabsSTTProvider,
  setSttProvider / getSttProvider for swap. Pure-TS pcm16-to-wav
  encoder — no transcription-only ffmpeg dep.
- voice/transcribe.ts: transcribeSegments — post-hoc, idempotent
  per-segment, degrades gracefully when no provider is configured.
- config/configure.ts: scenario.configure({ stt }) entry point.

Tests in follow-up commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tts.test.ts: cache key is (sha256(text), voice); effects apply AFTER
  cache hit (third call with new effect reads ORIGINAL cached PCM, not
  effect-baked bytes).
- stt.test.ts: default model = gpt-4o-transcribe; provider swap via
  setSttProvider; STTProvider interface minimal (no OpenAI types leak);
  >25-min audio splits into sub-chunks with concatenated transcripts.
- transcribe.test.ts: transcribeSegments fills missing transcripts in
  place, skips already-filled segments; missing STT degrades gracefully
  with a warning and never raises.
- configure.test.ts: scenario.configure({ stt }) round-trips a custom
  provider; null clears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cucumber

Retrofits PR #513's hand-rolled tests so the 7 scenarios they claim to
cover actually load and execute against specs/voice-agents.feature via
@amiceli/vitest-cucumber, matching the pattern landed by #517.

Scenarios tagged @ts-tts, @ts-stt, @ts-transcribe (domain-specific sub-tags
alongside @Unit) so each test file's includeTags filter targets exactly
the scenarios it owns without disturbing voice-contract-surface.test.ts
(which uses @ts-bound for the original 5 scenarios from PR1).

- tts.test.ts: loadFeature + describeFeature({ includeTags: ["ts-tts"] })
  binding "TTS cache key is (text, voice) only and effects apply after cache hit"
- stt.test.ts: loadFeature + describeFeature({ includeTags: ["ts-stt"] })
  binding 4 STT scenarios: default gpt-4o-transcribe, provider swap,
  minimal interface, >25-min chunking
- transcribe.test.ts: loadFeature + describeFeature({ includeTags: ["ts-transcribe"] })
  binding transcribe_segments fills-in-place + missing STT degrades gracefully

Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #372 (slice plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two /review must-fixes:

1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)`
   inside a synchronous Then callback. The promise resolved after the
   step completed, so any assertion failure was silently swallowed by
   vitest. Made the Then async and awaited the call directly.

2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited
   `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`,
   `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled.
   Note: transcribe.test.ts header already said `@ts-transcribe`
   correctly; only stt.test.ts and tts.test.ts needed updating.

Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene
+ principles.

Refs #516, #517, #513.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…VAD fallback (WIP)

PR3 of N for #372. Builds on PR1 (#511) types.

- Port `python/scenario/voice/adapter.py` runtime to `voice/adapter.runtime.ts`:
  * `asyncio.Event` -> `AgentSpeakingEvent` (Promise + resolve ref)
  * `async with` -> explicit `startVoiceAdapters` / `stopVoiceAdapters`
  * Default `call()` body: send -> drain on tail silence -> record -> return
  * Hook fan-out for `onAudioChunk` / `onVoiceEvent`
- Port `python/scenario/voice/vad.py` -> `voice/vad.ts`:
  * `WebRTCVadFallback` with one-shot warning per adapter (matches Python
    `_warned_adapters` memoisation, no rate-limit regression)
  * Activates only when `adapter.capabilities.nativeVad === false`
  * Pure-TS RMS energy + hysteresis detector ships today; webrtcvad
    C-library build pipeline is the decision-pending item.
- Patch `execution/scenario-execution.ts`:
  * Implement `VoiceExecutorState` structurally (Decision 1(b) from #372)
  * Pick voice adapters at run start; connect inside try, disconnect in
    finally so the spec-148-145 "regardless of pass/fail/exception"
    contract holds.
  * Wire `onAudioChunk` / `onVoiceEvent` from `ScenarioConfig`.
- Add `voice/__tests__/fixtures/fake-adapter.ts`: in-memory adapter, no
  real transport. Tests use this exclusively.
- Tests (vitest, bound to `specs/voice-agents.feature`):
  * `adapter-lifecycle.test.ts` lines 138-145
  * `hooks.test.ts` lines 449-461
  * `vad-fallback.test.ts` lines 772-791

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fail-on-call fixture

- ScenarioExecution.reset() recreated ScenarioExecutionState, losing the
  setExecutor linkage from the constructor. Voice adapters reaching
  input.scenarioState._executor would see null for the rest of the run,
  so hook fan-out / recorder never wrote into voice state. Re-attach in
  reset() so the linkage survives.
- FakeVoiceAdapter gains a failOnCall option — cleaner than spawning a
  second AGENT-role agent that would compete with the fake adapter for
  the agent() step (the executor picks the first role-matching agent).
- All 4 voice test files now green (21/21 voice tests, 381/381 total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… vitest-cucumber

Retrofits PR #515's hand-rolled tests for adapter lifecycle, hooks, and
VAD fallback to actually load and execute specs/voice-agents.feature
via @amiceli/vitest-cucumber, matching the pattern landed by #517 and
#513.

Tags by test file (per-file tagging needed because vitest-cucumber v6
fails the suite for scenarios that match a file's includeTags but
aren't bound in that file):

- @ts-adapter: connect/disconnect fires per-scenario
- @ts-hooks: on_audio_chunk and on_voice_event fire
- @ts-vad: VAD fallback / native-VAD does not trigger / one-shot warning

Key implementation note: vitest-cucumber v6 runs each Given/When/Then
step as a separate vitest it(). Module-level beforeEach/afterEach hooks
fire around each step, not around the whole scenario. For scenarios that
need to assert on console.warn calls across step boundaries, the spy is
installed locally within the When step and captured warn messages are
carried via closure-scoped variables into Then/And — avoiding the
floating-promise and spy-reset antipatterns.

Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #513 (PR-B,
ready for review), #372 (slice plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three /review must-fixes:

1. vad-fallback.test.ts: replaced the closure-capture spy pattern with
   the library's BeforeEachScenario/AfterEachScenario hooks. The
   coder's earlier workaround was based on the false belief that
   vitest-cucumber lacked scenario-level lifecycle hooks. The hooks
   exist (verified at @amiceli/vitest-cucumber 6.5.0
   describe-feature.js:311-322). BeforeEachScenario fires via
   beforeAll inside the scenario describe block — once per scenario,
   not per step. Spy is shared; capturedWarnCalls accumulates across
   steps within the same scenario. Removed ~28 lines of SPY STRATEGY
   prose comments.

2. hooks.test.ts: extracted the "throwing hook doesn't break scenario"
   check from inside the on_voice_event scenario's When step. It was
   asserting behavior the bound feature scenario didn't claim. Now a
   plain it() block outside describeFeature. Option (a) chosen: no
   spec scenario exists for this behavior in voice-agents.feature.

3. adapter-lifecycle.test.ts: split 5 sub-cases out of one packed And
   step. Kept only the happy-path disconnect assertion in the bound
   And step (disconnect fires once on success). Lifted fail/throw/
   multi-adapter/disconnect-swallow to 4 plain it() blocks. Option (b)
   chosen: specs/voice-agents.feature line 143 names the And step as a
   single AC ("regardless of pass/fail/exception") — the 4 sub-cases
   are implementation-level guarantees not individually specced.

Reviewer convergence: principles + test (3x). Refs #516, #517, #513, #515.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…udio messages (PR4 of N)

Ports the python voice path for simulator and judge to TypeScript:

- javascript/src/voice/messages.ts: createAudioMessage/extractAudio/
  messageHasAudio helpers using the local AudioMessageParam type.
  No openai package import — uses messages.types.ts (Decision 2(b)).
- javascript/src/agents/user-simulator-agent.ts: voice config triggers
  audio-message emission; per-step voice + per-step audio_effects +
  persona composition. stripAudioContent keeps LLM calls text-only.
- javascript/src/agents/judge/judge-agent.ts: JudgeAgent exported as class
  with static conversationHasAudio; effectiveIncludeAudio/Timeline/Traces
  helpers; auto-detect multimodal model via model name substrings;
  include_audio=false escape hatch.

13 scenarios bound to specs/voice-agents.feature via vitest-cucumber:
- 5 simulator scenarios (@ts-simulator)
- 7 judge scenarios (@ts-judge)
- 1 assistant-role scenario (@ts-assistant-role)

Tag convention: per-subject (@ts-simulator / @ts-judge / @ts-assistant-role)
instead of @ts-bound to avoid colliding with PR1's voice-contract-surface
test (which uses includeTags: ["ts-bound"] and would over-match new
scenarios). Per-file tagging is established by #513/#515; tag-convention
decision tracked at #523.

Refs #372 (slice plan), #517 (PR1 infra, merged), #513 (PR2, ready),

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… minor cleanups

/review surfaced 4 Must-Fix carry-forwards from prior PRs:

1. "Per-step voice override applies to only that step" scenario asserts
   no observable behavior — voiceStyle is set/cleared via setOneShotOverride
   but no TTS provider honors it. Spec retagged @todo (removed @ts-simulator)
   so future PRs that wire voiceStyle into _synthesize can re-bind. Test
   block removed. Honest absence beats paraphrase-as-binding. PR4 now binds
   12 scenarios (was 13).

2. voice-assistant-role.test.ts doc-comment claimed @integration but
   feature file tags @Unit. Fixed. Also fixed an internal comment that
   said "Python SDK" when the context was "TS SDK".

3. judge-voice.test.ts had 4-5 packed Then blocks (multi-model sub-cases
   stuffed into single bound Thens). Lifted sub-cases to plain it() blocks
   outside describeFeature; bound Thens now assert only spec-named behavior.

4. Hoisted mid-file zod import to top of judge-agent.ts.

Reviewer convergence: principles, hygiene, test. Refs #528, #516, #372.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… extensions (PR5 of N)

PR5 of the TS voice parity slice. Pure SDK orchestration — no external
service is touched, no UI runs. Wires the script-step DSL, interruption
config, recording runtime, and the optional ScenarioResult voice fields
behind the same contract surface the Python SDK already ships.

Adds:
  * javascript/src/script/voice-steps.ts — sleep, silence, audio, dtmf,
    interrupt (after-time + after-words), agent({ wait: false }),
    proceed({ interruptions, onTurn, onStep }), backgroundNoise.
    Imports from `@langwatch/scenario` script barrel as `voiceAgent` /
    `voiceProceed` so the existing positional `agent`/`proceed` stay
    untouched for callers.
  * javascript/src/voice/interruption.ts — InterruptionConfig class
    with shouldInterrupt / sampleDelay / pickRandomPhrase. RNG-pluggable
    so callers can pass a seeded PRNG for deterministic tests.
    CONTEXTUAL_PROMPT exported as a module-level constant.
  * javascript/src/voice/recording.runtime.ts — VoiceRecordingRuntime
    with WAV writer (native; canonical PCM16/24kHz/mono RIFF header) and
    MP3/OGG/FLAC via system ffmpeg subprocess. saveSegments() writes the
    segments dir + full.wav + JSON manifest. computeLatencyMetrics()
    aggregates avg/p50/p95 with ceiling-style p95.
  * ScenarioResult gains optional `audio`/`timeline`/`latency` fields —
    text-only runs leave them undefined (back-compat preserved).

Test files (all bound via vitest-cucumber against specs/voice-agents.feature):
  * src/script/__tests__/voice-steps.test.ts (11 scenarios, @ts-script-step)
  * src/voice/__tests__/interruption.test.ts (1 bound + 2 unit, @ts-interruption-cfg)
  * src/voice/__tests__/recording.runtime.test.ts (7 unit — not feature-bound)
  * src/voice/__tests__/result-extensions.test.ts (6 scenarios, @ts-result-ext)

Spec tags: @ts-script-step / @ts-interruption-cfg / @ts-result-ext sub-tags
scope each PR5 file's binding set; voice-contract-surface.test.ts now
uses excludeTags to keep ownership of the PR1 contract-surface set only.

Tsconfig: target=ES2022 so top-level await (vitest-cucumber pattern)
and `Set` iteration land without --downlevelIteration shims.

ffmpeg distribution decision pending — see PR body for options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…faces

Addresses /review concerns on PR5:

- Lift voiceInterruptions + voiceBackgroundNoise onto VoiceExecutorState
  so voiceProceed/backgroundNoise write through the same typed contract
  the voice subsystem already commits to (Decision 1(b) of #372). Drops
  three `as unknown as { _voice* }` indirections from voice-steps.ts.
- Expose agentSpeakingEvent + streamingTranscript + sendDtmf on
  VoiceAgentAdapter as optional/abstractable members. dtmf() now calls
  adapter.sendDtmf() directly — adapters that claim capabilities.dtmf
  while skipping the method get a loud UnsupportedCapabilityError from
  the base class instead of a silent PCM synthesizer fallback.
- Add bounded timeout to waitForStreamingWords so a wedged adapter that
  never advances its transcript can't lock the script forever
  (mirrors waitForAgentSpeaking's pattern).
- audio() URL_LIKE error message no longer suggests "download the asset
  locally" when the input is already a file:// URI.
- recording.runtime.test.ts skips MP3 transcoding cleanly when ffmpeg is
  not on PATH (itIfFfmpeg guard).
- Drop the unused DTMF PCM-synth fallback now that capability-method
  coupling is enforced at the base class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (PR6 of N)

Ports python/scenario/voice/effects/* to javascript/src/voice/effects/*:
- common.ts (EffectFn type, PCM16 <-> Int16Array helpers)
- noise.ts (backgroundNoise, static_, multipleVoices) + 5 bundled WAVs
- prosody.ts (lowVolume, highVolume, speakingFast, speakingSlow)
- quality.ts (phoneQuality via fft.js, lowQuality, packetLoss, echo, robotic, breakingUp)
- custom.ts (user-fn wrapper with type validation)
- index.ts barrel re-exporting static_ as static

Adds fft.js dep (FFT for phoneQuality bandpass). Updates tsup.config.ts
to cpSync src/voice/assets to dist/voice/assets; package.json files
includes src/voice/assets/** so WAVs ship in published npm package.
Bundle delta ~132KB (5 x 24KB WAVs + LICENSES) — under the 1MB budget.

Binds 5 scenarios in specs/voice-agents.feature with tag @ts-effects
(per-subject tag, NOT @ts-bound, to avoid collision with PR #517's
voice-contract-surface.test.ts that already owns @ts-bound; follows
PR #528 convention from issue #523).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review fanout flagged:
- effects unreachable via voice namespace (voice/index.ts had no re-export)
- TS2802 on [...BACKGROUND_PRESETS].sort() (Set iteration)
- require('fft.js') with manual type cast + eslint suppression
- conjugate-symmetry mirror hand-rolled instead of fft.completeSpectrum()
- 3 near-identical linearResample loops across noise/prosody/quality
- double static_/static export (pick one for the public name)

Fixes:
- voice/index.ts: export * as effects from './effects'
- effects.test.ts: regression assertion via voice namespace import
- noise.ts: Array.from() instead of spread; use linearResample helper
- quality.ts: import FFT from 'fft.js'; fft.completeSpectrum(); linearResample x2
- prosody.ts: linearResample helper
- common.ts: new linearResample(arr, newLen): Int16Array
- effects/index.ts: drop bare static_ re-export, keep only static alias
- effects.test.ts: JSDoc note that on_turn Scenario binding is a unit-level
  proxy for the runtime hook that lands in PR3 (#515)

pnpm -C javascript build: green
pnpm -C javascript test: 22 files / 392 tests pass
pnpm -C javascript typecheck: pre-existing TS1378 from PR #517 only; no
new errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…diom

Review nits from re-review of PR #537:
- public-API surface test asserted only 3 callables; iterate all 14 §4.5
  effects so a missing barrel re-export fails fast.
- prosody._resampleFactor wrapped linearResample with int16ToPcm16 while
  quality.lowQuality used `new Uint8Array(buf.buffer)`. The clip in
  int16ToPcm16 is a no-op on Int16Array input — use the zero-copy view
  in both places.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anded (PR7 of N)

PR7 of issue #372 — the first real voice transport. Ports three Python
adapters to TS and binds 7 scenarios in `specs/voice-agents.feature`.

What lands:

- `javascript/src/voice/adapters/elevenlabs.ts` — `ElevenLabsAgentAdapter`,
  the hosted ConvAI adapter. Connects to `wss://api.elevenlabs.io/v1/convai/conversation`
  via the `ws` package; PCM16/24kHz base64-over-JSON; full event handling
  (audio, ping, transcript, correction, init-metadata, interruption).
  Mirrors `python/scenario/voice/adapters/elevenlabs.py`.

- `javascript/src/voice/adapters/composable.ts` — `ComposableVoiceAgent` +
  `STTProvider` interface + `ElevenLabsSTTProvider` + inline `synthesize`
  helper (elevenlabs/ provider only — PR2 #513 supplies the rest). LLM is
  any ai-sdk `LanguageModel`. Mirrors `python/scenario/voice/adapters/composable.py`.

- `javascript/src/voice/adapters/eleven-labs-voice-agent.ts` —
  `ElevenLabsVoiceAgent`, the branded preset. Provider-typed options;
  defaults to `ElevenLabsSTTProvider` + `openai("gpt-5.4-mini")` +
  `elevenlabs/EXAVITQu4vr4xnSDxMaL` (Sarah — free-tier premade); each
  piece independently overridable. `eleven_v3` TTS model hardcoded for
  paralinguistic-marker support (per Python tts.py:107 comment).

Tests:

- `javascript/src/voice/adapters/__tests__/elevenlabs.test.ts` — 5 unit
  scenarios bound via `describeFeature(..., { includeTags: [["unit", "ts-elevenlabs"]] })`.
- `javascript/examples/vitest/tests/voice/elevenlabs-hosted.test.ts` — 2
  e2e scenarios env-gated on `ELEVENLABS_API_KEY` (+ `ELEVENLABS_AGENT_ID`
  for the hosted demo). Without keys, the suite cleanly skips.

Tag convention: `@ts-elevenlabs` (per-subject) rather than `@ts-bound` —
per the precedent from PRs #517 / #528 (`@ts-simulator`, `@ts-judge`,
`@ts-assistant-role`), per-subject tags avoid the `checkUncalledScenario`
collision with PR1's contract-surface test. See #523 for the
tag-convention decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rotocol tests

Review pass on PR #536 surfaced four actionable concerns. Addressed:

- **#1 (blocking) — `connect()` left WS without `error`/`close` handlers
  after `onOpen` called `removeAllListeners()`.** An unhandled `error`
  on a Node EventEmitter crashes the process. Re-attach `message` +
  `error` + `close` listeners atomically post-open. The new `error`
  handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio`
  fail fast instead of writing to a dead socket. Pending receivers
  drain to empty `AudioChunk` so the executor unwinds rather than
  hanging.

- **#2 (blocking) — `onMessage` branches were untested.** Added 14
  wire-protocol unit tests (plain vitest, not cucumber-bound) covering:
  base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO,
  ping → pong with `event_id`, ping defensive (no `event_id` skip),
  `user_transcript` capture, `agent_response` capture,
  `agent_response_correction` override, format-drift warning,
  interruption + unknown event swallow, non-JSON frames ignored,
  post-open socket error drain, socket close drain, and `receiveAudio`
  timeout.

- **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`,
  violating `voice-models.ts`'s self-declared single-source-of-truth
  contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` +
  `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` +
  `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity:
  `python/scenario/config/voice_models.py`). Adapters now import from
  there.

- **#6 — `receiveAudio` referenced `waiter` from inside the timer body
  before its `const` declaration.** Worked by event-loop ordering;
  fragile to refactor. Forward-declared `let timer` and put `waiter`
  ahead of the `setTimeout` so the dependency graph is explicit.

Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests).
Build: tsup CJS + ESM + DTS clean.

Deferred (intentional, tracked in PR body):
- #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design
  with PR2 (#513); merge-order constraint.
- #7: `turnOutputEmitted` latch contract with PR3 executor — surface in
  PR3 review.
- #8: distinguish natural end-of-turn from socket close — design-level,
  needs PR3 design conversation.
- #9: `featurePath()` helper — extract once a 3rd test file would
  duplicate the climb.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r roles) (PR8 of N)

Port `python/scenario/voice/adapters/openai_realtime.py` to TypeScript at
`javascript/src/voice/adapters/openai-realtime.ts`. The adapter owns the
OpenAI Realtime wire protocol directly — the model IS the agent under
test (`role=AgentRole.AGENT`) or the voice-enabled user simulator
(`role=AgentRole.USER`, per §7.2 L1164-1171).

User-role critical path: scripted `user("text")` lines call `sendText`,
which emits `conversation.item.create` (`input_text` content) +
`response.create` directly. TTS is bypassed — the realtime model owns
prosody synthesis.

Wire-protocol behavior:
- WSS to `wss://api.openai.com/v1/realtime?model=<model>` via `ws`
- `session.update` post-connect (pcm16/24000 in/out, voice, instructions,
  tools, server-side VAD off so we own turn boundaries)
- `sendAudio` → `input_audio_buffer.append` (deferred commit)
- `receiveAudio` → commit + response.create on first call, loops over
  events until `response.audio.delta`; transcript deltas update
  `lastAgentTranscript`, Whisper user transcripts update
  `lastUserTranscript`
- `interrupt()` → `response.cancel` (first-class interrupt per §5.6)

Scenarios bound (`specs/voice-agents.feature`):
- @Unit @ts-openai-realtime — agent connect + user-simulator wiring
- @e2e @ts-openai-realtime-agent-demo — live agent-role round-trip
- @e2e @ts-openai-realtime-user-demo — live user-simulator with sendText

Per-subject tags avoid collision with PR1's `voice-contract-surface.test.ts`
which uses `includeTags: ["ts-bound"]` (single-axis OR). Dual-axis filters
`[["unit", "ts-openai-realtime"]]` keep unit binding tight.

Tests:
- `javascript/src/voice/adapters/__tests__/openai-realtime.test.ts` — 2
  @Unit scenarios driven against an in-process `ws` server (asserts
  wire-protocol shape, transcript accumulation, response.cancel,
  capability matrix). 7 step assertions pass.
- `javascript/examples/vitest/tests/voice/openai-realtime-agent.test.ts`
  — agent-role e2e demo, env-gated on `OPENAI_API_KEY` via
  `Scenario.skip`.
- `javascript/examples/vitest/tests/voice/openai-realtime-user.test.ts`
  — user-role e2e demo proving `sendText` is the TTS-free path.

Dependencies:
- Adds `ws` 8.20.1 + `@types/ws` 8.18.1 to the javascript workspace
  (Realtime WSS transport).

/browser-qa-against-prod evidence env-gated: `OPENAI_API_KEY` UNSET in
the grinder's environment so e2e demos report as skipped. CI gate runs
them when the secret is configured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tools, sync disconnect)

Surfaced by /review skill (PR #535):

- **Sync disconnect:** `disconnect()` now eagerly rejects any in-flight
  `receiveAudio` waiter and flushes the event queue instead of relying on
  the async `close` handler. Prevents waiters from blocking past the close
  and stale-queued events from leaking into the next session.
- **API key validation:** `connect()` throws a named diagnostic when no
  key is set, instead of letting the request surface as a generic
  WebSocket 401.
- **`url` init knob:** `OpenAIRealtimeAgentAdapterInit.url` lets tests
  point at a loopback WS server without subclassing the adapter. The unit
  test now constructs the adapter directly — the `TestAdapter` subclass
  is gone.
- **Structural tool type:** `tools: unknown[]` → `RealtimeToolDef[]`
  (exported), so call-site typos surface at compile time. Sets the
  template for the four remaining adapter ports.
- **Single timeout site:** dropped the unreachable outer-loop deadline
  check in `receiveAudio` — `_nextEvent` already arms a per-iteration
  timer that fires the same error.
- **PCM16 truncate removed:** the AudioChunk constructor already enforces
  even-byte invariant; adapter-side truncation was belt-and-suspenders
  that would hide an upstream codec bug.
- **E2E agent demo:** moved the `expect(chunk).toBeInstanceOf(AudioChunk)`
  assertion from `When` into `Then` where it belongs.

Deferred (out-of-scope or PR3 territory):
- Logger surface for non-JSON frame drops (Python emits `logger.debug`;
  TS port has no logger yet — file when the SDK introduces one).
- `responseTimeout` / `responseTailSilence` / `responseMaxDuration` are
  inherited from `VoiceAgentAdapter` but inert until PR3 wires the
  executor. PR3 must consume them.

Gates re-validated: build green (CJS + ESM + DTS), 383/383 tests pass,
eslint clean on touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure root cause: `AudioChunk`, `OpenAIRealtimeAgentAdapter`,
`OPENAI_REALTIME_MODEL`, `silentChunk` are exposed at the package root
via `export * as voice from "./voice"` — they're NOT named exports on
the root barrel. Direct named imports resolved to `undefined`, so
`expect(firstChunk).toBeInstanceOf(AudioChunk)` saw `undefined` and
`new OpenAIRealtimeAgentAdapter(...)` was a `TypeError`.

Switched both e2e demos to destructure from the `voice` namespace and
narrowed the local type aliases to `voice.AudioChunk` /
`voice.OpenAIRealtimeAgentAdapter`. Unit tests are unaffected — they
import from the local `../../index` re-export and never see the package
root.

CI was running the e2e demos because `OPENAI_API_KEY` IS configured in
the CI env. Locally the same path skips (key unset). The skip-path test
exit was a false positive — the actual binding consistency check needed
the run path to fire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s it)

CI surfaced the real issue: the OpenAI Realtime endpoint at
`wss://api.openai.com/v1/realtime` is now GA and rejects the
`OpenAI-Beta: realtime=v1` opt-in with:

  The Realtime Beta API is no longer supported. Please use /v1/realtime
  for the GA API.

We were sending the header per Python parity (`python/scenario/voice/
adapters/openai_realtime.py`); the GA migration deprecates it. Dropped
the header and updated the file-level docstring to document the choice.

Python parity is intentionally broken here — Python adapter still sends
the Beta header and will hit the same error. Track for back-port to
keep the two SDKs aligned.

Local: 383/383 unit tests pass, build green. CI re-run pending; e2e
demos should now connect successfully against the GA endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced "Missing required parameter: 'session.type'" after the
Beta-header drop — the GA Realtime API restructured the session config
significantly (per RealtimeSessionCreateRequest in openai-node
realtime.ts).

Migrated session.update payload:
- session.type: "realtime" (required discriminator)
- session.model: passes the model id explicitly
- audio formats moved under session.audio.{input,output}.format as
  { type: "audio/pcm", rate: 24000 } objects
- voice moved under session.audio.output.voice
- transcription + turn_detection nested under session.audio.input

Unit test wire-shape assertions updated to match. Old shape fields
(input_audio_format, output_audio_format, top-level voice, top-level
turn_detection) are gone; the assertions now look at
audio.input.format, audio.output.voice, etc.

Python parity is intentionally broken here — the GA migration deprecates
the wire surface Python uses. Track for back-port to keep the SDKs
aligned. The Python adapter will hit the same error against the live
endpoint.

Local: 383/383 unit tests pass, build green (CJS + ESM + DTS).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI issues after the GA wire-shape migration:

1. **Voice 'nova' is Beta-era, GA rejects it.** Supported voices are
   alloy/ash/ballad/coral/echo/sage/shimmer/verse/marin/cedar. Switched
   the user-role demo to `marin` (OpenAI's recommended modern voice).
   The BDD scenario text still names "nova" — that documents Python's
   parity intent; the test picks a valid GA voice.

2. **Agent-role demo deadlocks on silentChunk.** Sending 0.5s of silence
   to a Realtime session with `turn_detection: null` doesn't trigger the
   model; receiveAudio(20) times out and `chunk` stays null. The unit
   scenarios already prove the audio round-trip via a mock WS. The e2e
   demo's job is to prove live-endpoint connectivity, so rewrote it as
   a smoke test:
   - connect (GA handshake + session.update accepted)
   - interrupt (response.cancel round-trips against the live wire)
   - disconnect

   The Then assertion now verifies connectError is null and the
   capability matrix is published — wire health, not a model response.
   PR3 will drive real speech audio through the executor.

Local: 383/383 unit tests pass.
CI: receiveAudio timed out after 81s on the user-role e2e demo. Root
cause: GA renamed the streaming output events:

  Beta                              → GA
  response.audio.delta              → response.output_audio.delta
  response.audio.done               → response.output_audio.done
  response.audio_transcript.delta   → response.output_audio_transcript.delta
  response.audio_transcript.done    → response.output_audio_transcript.done

The Beta names are no longer emitted by the live endpoint, so the
receive loop never saw an audio frame.

Updated the event matcher to accept both names. The new GA name wins on
the live endpoint; the Beta alias keeps the existing unit tests (which
push the legacy event names) working without churn, and makes back-port
to any Beta-era endpoint trivial.

Local: 383/383 tests pass.
Ports python/scenario/voice/adapters/gemini_live.py →
javascript/src/voice/adapters/gemini-live.ts using @google/genai
(the new SDK; @google/generative-ai is the deprecated package).

- GeminiLiveAgentAdapter with capabilities matrix (streaming
  transcripts, native VAD, interruption, pcm16/16000 in,
  pcm16/24000 out)
- PCM16 24kHz↔16kHz resampler in pure JS (linear interpolation,
  no scipy)
- Callback-to-queue bridge mapping the SDK's onmessage callback
  onto an awaitable receiveAudio(timeout) contract
- @google/genai declared as optional peer dep; lazy-imported on
  connect() so the SDK ships without a hard Gemini coupling
- 2 @Unit scenarios (connect, capabilities matrix) bound via
  vitest-cucumber + 1 @e2e demo scenario (env-gated on
  GEMINI_API_KEY/GOOGLE_API_KEY)

Refs #372.
…f N)

Ports python/scenario/voice/adapters/{pipecat.py,_twilio_shared.py} to
TypeScript so voice scenarios can target a running Pipecat bot over the
Twilio Media Streams WS protocol. WebRTC transport is deferred and
raises PendingTransportError at connect() time.

New files
- src/voice/adapters/twilio-shared.ts — g711 µ-law 8 kHz ↔ PCM16 24 kHz
  codec + 24k/8k linear-interpolation resampler + Twilio Media Streams
  frame parser/builders. Reused by the upcoming TS Twilio adapter (PR11).
- src/voice/adapters/pipecat.ts — PipecatAgentAdapter speaking the
  synthetic connected/start handshake, 20 ms µ-law media frames, clear
  for first-class interrupt, mark "utterance_end" as end-of-turn signal.
- src/voice/adapters/pending-transport-error.ts — shared deferred-
  transport error class (parity with python _stub.PendingTransportError).
- src/voice/adapters/__tests__/twilio-shared-codec.test.ts — binds the
  two @ts-codec scenarios (round-trip fidelity + sample-rate conversion)
  plus plain-vitest edge-case tests.
- src/voice/adapters/__tests__/pipecat.test.ts — binds the three
  @ts-pipecat scenarios (WS round-trip, WebRTC PendingTransportError,
  clear-buffer interrupt) against a synchronous fake WebSocket.

Capabilities advertised
  streamingTranscripts=true, nativeVad=true, dtmf=false,
  interruption=true, input/outputFormats=[pcm16/24000, mulaw/8000].

Notes for reviewers
- 5 feature-file scenarios are bound (2 retagged, 3 new). Tag axis is
  @ts-pipecat / @ts-codec to match the @ts-<adapter> precedent set by
  PR #535 (OpenAI Realtime) and PR #536 (ElevenLabs).
- /browser-qa-against-prod is env-gated on SCENARIO_PIPECAT_QA_WS_URL.
  CI does not set the var; documented under "/browser-qa note" in the
  PR body. No script ships in this PR — adding one would require a
  user-owned bot endpoint we don't have.
- `ws` 8.20.1 + @types/ws 8.18.1 added as deps (matches PR #535).
- tsconfig.target=ES2022 added (matches PR #535).
…cases

Addresses 5 review concerns (review #540 synthesizer pass):
- #1 perf: receive-side mulaw buffer now stores Uint8Array slices, not
  number[]; bufferMulaw is O(1) per call instead of O(n) per byte.
- #2 docs: coerceFrameToText's 0x7b/0x5b heuristic is now documented as a
  known rare-collision risk (binary µ-law with first byte == { or [
  would mis-route to JSON parser and silently drop).
- #4 test pyramid: round-trip scenario re-tagged @Unit (FakeWebSocket =
  no network) — real-WSS @integration demo deferred behind env-gated
  bot endpoint per /browser-qa note.
- #5 coverage: 2 new edge-case tests for partial-buffer flush on
  bot-sent `stop` event and on socket-close.

Not addressed in this PR (filed as follow-up considerations):
- #3 vestigial audioFormat/sampleRate fields (inherited from Python parity)
- #6 DTMF/E.164 validation regex port (pre-requisite for PR11 Twilio)
- #8 extract TwilioMediaStreamsTransport helper (PR11 prep)
- #9 JSON-frame size cap (no regression vs main; same constraint as Python)
- #10 FakeWebSocket vs node:events (cosmetic)
…1 of N)

Ports python/scenario/voice/adapters/{twilio,_twilio_server,_twilio_shared}.py
to TypeScript:

- `twilio-shared.ts` — µ-law/PCM16 codec (8 kHz ↔ 24 kHz resample inline,
  no `audioop` in Node), Media Streams JSON frame parser/builders, E.164
  + DTMF validators, minimal Twilio REST client over fetch (no `twilio`
  npm SDK), HMAC-SHA1 signature verification.
- `twilio.ts` — `TwilioAgentAdapter` extending `VoiceAgentAdapter`.
  Capabilities: `inputFormats: ["mulaw/8000"]`, `outputFormats: ["mulaw/8000"]`,
  `interruption: true` (clear-buffer event), `dtmf: true`. Implements
  `placeCall`, `waitForCall`, `sendAudio`, `receiveAudio`, `sendDtmf`,
  and `interrupt`.
- `twilio-server.ts` — local HTTP + WS server (node `http` + `ws`) that
  impersonates Twilio's media-stream endpoint. Binds on an OS-assigned
  port (no hard-coded 8765). TwiML route returns `<Connect><Stream>` with
  the stream URL XML-escaped; signature gate fails closed.
- `twilio-tunnel.ts` — wraps `@ngrok/ngrok` (preferred) with a
  `localtunnel` fallback. Both are dynamic-imported as optional peer
  deps so they don't bloat the runtime bundle.

Scenarios bound in `specs/voice-agents.feature` via vitest-cucumber:

- `@integration @ts-bound @ts-twilio-proto` x3 — capabilities, JSON
  protocol parser, clear-buffer interrupt (twilio.test.ts).
- `@integration @ts-bound @ts-twilio-server` x2 — TwiML response shape +
  XML-escape, signature rejection (twilio-server.test.ts).
- `@e2e @ts-bound @ts-twilio-tunnel` x1 — tunnel exposes local server.
  Env-gated on NGROK_AUTHTOKEN (twilio-tunnel.test.ts).

Boy scout fixes in the same commit:

- `tsconfig.json` — added `target: "ES2022"` so `tsc --noEmit` accepts
  top-level await + iterators. Without this, `pnpm typecheck` is broken
  on `main` post #517 (the @ts-bound retrofit shipped top-level await
  but didn't update the target).
- `voice-contract-surface.test.ts` — narrowed `includeTags` from
  `["ts-bound"]` to `[["ts-bound", "ts-contract-surface"]]`. The
  retrofit's broad filter was destined to over-include any future
  `@ts-bound` scenario (PR-B/C/etc.); my Twilio scenarios surfaced the
  bug. Re-tagged the five contract-surface scenarios accordingly.
- `package.json` — added `ws@^8.20.1` runtime dep + `@types/ws` devDep.

Hazards documented in PR body:

- PR10 (Pipecat g711) hadn't pushed at branch time, so PR11 owns
  `twilio-shared.ts`. When PR10 lands, the two files reconcile (same
  module name and surface area).
- `@ngrok/ngrok` is a heavy native dep — kept optional and dynamic-
  imported so CI machines without NGROK_AUTHTOKEN don't pull it.
- Tunnel test is env-gated; CI does not exercise it.

Refs #372.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drewdrewthis and others added 22 commits May 31, 2026 15:32
…i adapter

The TS Gemini Live adapter previously detected the spurious
'interrupted → turnComplete' pair after a barge-in and 'continue'd the
dequeue loop. Worked in isolation, but lost the race in the live
executor: the cancelled-turn boundary arrived after the bounded
interrupt() drain returned, so the demo compensated with two
scenario.agent() calls.

Port Python's session.receive() iterator-restart pattern (in-place):
on spurious-pair detection, reset detection state AND extend the
receiveAudio() deadline by SPURIOUS_PAIR_RECOVERY_MS (10s) so delayed
recovery audio is captured within the same call.

+1 unit test ('extends the deadline after the spurious pair') locks
in the fix — 600ms delay overflows the original 500ms budget but lands
within the extended 10s window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…io.agent()

The two-scenario.agent() workaround was the demo-level compensation for
the timing race fixed in the prior commit. With the iterator-restart
deadline-extension in receiveAudio(), one agent() call now captures the
recovery within the same call.

Mirrors python/examples/voice/gemini_live_interruption.py exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… UserSimulatorAgentWithVoice

- New canonical file: javascript/src/domain/agents/agent-shapes.ts
  - Uses generic { tts?: { voice?: string } } instead of VoiceConfig import
    so the file stays import-free of the voice layer (domain owns shapes,
    voice layer owns transport capabilities)
  - Adds UserSimulatorAgentWithVoice derived narrowed type (voice: string,
    not optional) so executors holding a isVoiceUserSim()-guarded reference
    get type-safe access to voice without a secondary null-check
- domain/agents/index.ts re-exports the new shapes (isRealtimeUserAgent,
  isVoiceUserSim, RealtimeUserAgent, VoiceUserSimulator,
  UserSimulatorAgentWithVoice)
- voice/agent-shapes.ts becomes a @deprecated re-export shim pointing at
  the new canonical location — keeps old import paths working without churn
- scenario-execution.ts updated to import from domain/agents/agent-shapes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Items addressed (items 1 and 6 skip no-touch files recording.types.ts,
gemini-live.ts):

- Item 2: collapse pipecat interrupted+discardingInboundAudio into single
  interruptPhase: "idle"|"interrupted" — the two flags were always set and
  cleared in lockstep; unifying eliminates the risk of one advancing while
  the other lags. Both consumer points (receiveAudio drain gate, bufferMulaw
  frame-discard gate) now check interruptPhase === "interrupted".

- Item 3: drop redundant bargeInDelayMs && prefix in fireUserInterrupt.
  TypeScript requires a null-coalescing approach (?? 0) since the field is
  number|undefined; simplified to (bargeInDelayMs ?? 0) > 0.

- Item 4: justify the Math.floor asymmetry between the two delaySeconds
  call sites. maybeScheduleInterruptedAgentTurn stores the value as integer
  ms (floor intentional); maybeInjectInterruption passes directly to
  setTimeout (fractional ms fine, Node clamps). Added a comment explaining
  the asymmetry so it no longer reads as a forgotten floor.

- Item 5: trim interrupt() JSDoc from multi-paragraph essay to enumerated
  side-effects list. The method name conveys intent; the side-effects list
  is the useful reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…med substeps

Reduces the ~115-line monolith to a ~20-line orchestrator by extracting
three private methods:

- resolveNextAgentForInlineBarge(): walks pendingRolesOnTurn to verify
  AGENT is the next runnable role and returns { idx, agent } or null.
  Replaces the inline bail/lookup phase (steps 1+guard checks).

- consumePendingRolesUntilAgent(agent): removes AGENT from
  pendingAgentsOnTurn and shifts pendingRolesOnTurn up to and including
  the AGENT slot so the next _step advances to JUDGE cleanly. Mirrors
  Python's while-loop.

- dispatchAgentBackground(idx): constructs the non-blocking task entry
  (adds .then(()=>undefined) to coerce callAgent's Promise<ScenarioResult|null>
  to Promise<void>), sets pendingAgentTask, and returns the entry for
  caller inspection. Mirrors Python's asyncio.create_task().

- prepareAndFireBargeIn(config, voiceUserSim, entry): samples delay,
  TTS phrase, fires the inline barge-in, records the user message.

Adds 6 targeted unit tests in proceed-interruptions.test.ts covering the
main branches of each substep (resolveNextAgentForInlineBarge returns
null when USER comes first / when queue is empty;
consumePendingRolesUntilAgent pops up to AGENT+leaves JUDGE;
dispatchAgentBackground sets pendingAgentTask and flips done=true).

Tests: 775 pass / 1 skipped (up from 769).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent for barge-in reuse

- New file javascript/src/voice/utils.ts with export function sleep(ms):
  single canonical Promise-based sleep (ms <= 0 resolves immediately, no
  timer allocation). Issue #576 smell 1.

- pipecat.ts: imports sleep from ../utils; removes the local copy.

- voice-steps.ts: imports sleepMs from ../voice/utils; removes the local
  delay() function; all 5 delay() call sites delegate to sleepMs().

- adapter.runtime.ts: appendEvent() promoted from module-private to
  exported, with JSDoc. Single canonical writer for voice timeline events
  (push to voiceTimeline + voiceRecording.timeline + onVoiceEvent hook).
  Issue #576 smell 2.

- scenario-execution.ts:
  - imports appendEvent from adapter.runtime
  - imports sleep from voice/utils
  - 3 inline setTimeout-based sleeps replaced with sleep(ms)
  - 2 inline 3-part push/timeline/hook sequences replaced with
    appendEvent(this, event) — prevents event-shape drift by centralising
    the write in one place

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…under interruptOverrides bag

Replaces three formerly scattered @internal public fields with a single
named gateway:

  exec.interruptOverrides = { rng: () => 0 }

Fields in the bag:
- rng?: () => number  — replaces the public interruptRng field (now a
  private getter that reads interruptOverrides.rng ?? Math.random). Tests
  no longer need `as unknown as { interruptRng: () => number }`.
- waitForSpeechMs?: number  — seam for fireUserInterrupt's speech-wait
  bound (runtime path still uses the per-barge-in interruptWaitForSpeechMs
  field set by the interrupt() step; override bag is a fallback).
- bargeInDelayMs?: number  — seam for the post-speech barge-in delay
  (runtime path still uses the per-barge-in interruptBargeInDelayMs field
  set by prepareAndFireBargeIn; override bag is a fallback).

Test updates:
- proceed-interruptions.test.ts: 4 interruptRng cast sites → interruptOverrides
- proceed-interrupt.test.ts: 1 interruptRng cast site → interruptOverrides
- proceed-interrupt-errors.test.ts: 2 interruptRng + 1 interruptBargeInDelayMs
  cast sites → interruptOverrides / direct field access (field is still public)

Zero `as unknown as { interruptRng: ... }` casts remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mestampOnly; clear stale gemini-live future-fix comment

Closes #574 items 1 + 6 (the no-touch carve-outs from wave 2's commit
acad1da).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #582 (most sub-items). Deferred: cannedPhrases derivation from
InterruptionConfig — owned by parallel #583 work on random-interruptions.test.ts.

- PII-log redaction: gemini-live-interruption.test.ts recoveryTranscript
  log now reports length-only and is gated to the fired_after_speech branch
- WAV/LFS policy: decision captured in TESTING.md (trigger at 50 MB)
- Timeout tightening: proceed-interrupt.test.ts tests reduced 30 s → 5 s
- Behavioral test name: renamed mechanism-describing test to outcome-describing
- Factory test relocation: "exposes the value" moved to new
  agents/__tests__/user-simulator-agent.factory.test.ts
- beforeEach reset: gemini-live.test.ts standalone describe gains
  beforeEach(() => { captured.last = null }) and drops per-test manual resets
- bargeInDelayMs > 0 branch: two new focused tests in proceed-interrupt.test.ts
- Serializer round-trip: two new tests in recording.runtime.test.ts cover
  transcript_truncated=true and absence branches in saveSegments manifest

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ioPlayback })

Mirrors Python's ffmpeg/PyAudio approach: when audioPlayback is true,
each agent/user audio chunk is fanned out to a per-run AudioPlaybackSink
alongside the recording. Headless CI gracefully no-ops when no audio
device is available.

Re-activates the configure({ audioPlayback }) knob that was stored-but-not-
yet-consumed — it is no longer dead.

Implementation:
- new javascript/src/voice/playback.ts: AudioPlaybackSink class using the
  bundled ffmpeg-static binary piping PCM16 to platform audio driver
  (ALSA/AudioToolbox/DirectShow). Degrades gracefully: error or non-zero
  exit from ffmpeg emits one console.warn and no-ops sendChunk.
- VoiceExecutorState: add audioPlaybackSink optional field.
- adapter.runtime.ts fireAudioChunk: fan-out 2 — sends to sink alongside
  the existing onAudioChunk hook.
- scenario-execution.ts: construct sink when audioPlayback === true (per-run
  voice config wins over global configure() per ADR-002); close after
  stopVoiceAdapters in finally.
- configure.ts: update doc comment — no longer "stored-but-not-yet-consumed".

Tests:
- playback.test.ts: 8 unit tests for AudioPlaybackSink (subprocess mock) +
  2 executor-wiring tests (spawn called iff audioPlayback: true).

Device-bound caveat: cannot live-verify the sink without an audio device —
the unit tests use a mocked subprocess. E2E verification requires a host
with ALSA/AudioToolbox/DirectShow.

Closes #585.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…in playback.test

Two regressions from 05f9f4c:
- playback.ts:95,109 accessed proc.stdin/proc.stderr without null
  guards; child_process.spawn returns those as Readable|null when
  stdio is not configured as 'pipe'. Added guards with warn-once
  graceful degrade.
- playback.test.ts hit vi.mock hoisting: the mock factory referenced
  a module-level mockProc declared AFTER the vi.mock() call (which
  vitest hoists to the top). Switched to vi.hoisted() so mockProc is
  available at hoist-time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ttern

GeminiLiveAgentAdapter.dequeue() had a single-consumer resolveNext slot.
When interrupt() called dequeue() concurrently with an in-flight
receiveAudio() (during a barge-in), the second caller overwrote the
first caller's resolver → the in-flight receiveAudio() timed out with
TimeoutError → drainAgentResponse caught and broke → interrupted segment
ended truncated but no recovery captured.

Switch interrupt() to a non-competing abort-sentinel pattern: set a
_interruptPending flag + wake any in-flight dequeue with the sentinel.
receiveAudio()'s loop checks the flag at each iteration and returns the
cut-off sentinel promptly. interrupt() no longer competes on the queue.

New unit test locks in the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cut-off

Now that the Gemini Live adapter's dequeue() concurrency race is fixed
(prior commit), the barge-in properly cuts mid-stream + recovery is captured.

Switch random_interruptions from Pipecat (burst-streams TTS, defeating
post-arrival cancel) to Gemini Live (realtime-streaming + server-side
cancel + now-fixed adapter).

Re-adds the median-shorter assertion (ratio < 0.8): the truncated
segment is now meaningfully shorter than the median agent reply.

Closes #583 — pending live verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From the post-grind /review pass:

- CORRECTNESS: playback.ts close() hung if subprocess exited early.
  Set this._proc = null in exit handler (mirroring error handler) so
  the if (!this._proc) guard in close() short-circuits. +1 regression
  test. Also: proc.on → proc.once in close() to stop leaking listeners.
- HYGIENE: twilio.ts was the 5th sleep callsite the #576 consolidation
  missed — migrated to shared sleep() from voice/utils.
- HYGIENE: 2 test files (proceed-interrupt-errors, proceed-interrupt)
  used inline new Promise(setTimeout) instead of shared sleep().
- SECURITY: gemini-live.ts error message bound JSON.stringify(goAway)
  to 300 chars to prevent unbounded log growth from large server struct.
- TEST: proceed-interrupt-errors.test.ts 30s timeouts tightened to 5s
  (actual runtime ~200ms; matches sister proceed-interrupt.test.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (TS-only generator)

Audit feedback: repo-root scripts/ is for cross-language tooling; this
script writes only to javascript/src/voice/assets/noise/ and has zero
Python callers. Belongs in javascript/scripts/ alongside other TS-only
asset/build tooling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…by description)

Audit feedback: the doc reads "Engineering Design Record (sits between
the PRD and the PRs)" — that's ADR-shape. Sitting in docs/voice/ next
to capability-matrix.md and happy-path-*.md (which are docs-site
material) was wrong placement.

Header reshaped to match ADR-001/002 convention (title prefixed
ADR-003, Date/Status preamble, companion-doc paths fixed for the new
location). Body framing line removed where the header preamble now
says the same thing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s → outputs

User feedback: "recordings" describes the file format; "outputs"
describes the purpose (these dirs hold what the example tests
produced). The helper that writes here keeps its name
(saveDemoRecording) — it still SAVES a recording, the recording is
just NAMED an output now.

Updates the writing helper's RECORDINGS_ROOT to point at outputs/,
all test-file doc-comment path refs, the recordings README (title,
intro, GitHub blob URL example, section header), .gitignore patterns,
the voice-integration CI workflow's upload path, TESTING.md fixture
paths, and fixes the (pre-existing) broken link in javascript/README.md
that pointed at ./recordings/README.md.

Python's python/recordings/ stays for now; renaming there is a
follow-up issue (filed separately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t types

User feedback: outputs/ should be a parent for all test-run artifact
types (recordings now, traces/logs/screenshots later). Moves every demo
into outputs/recordings/<demo>/ and adds a new thin outputs/README.md
that documents the artifact-parent shape. The rich audio policy /
per-demo coverage table stays where it belongs at
outputs/recordings/README.md.

Writer (tests/voice/helpers/save-demo-recording.ts) updated:
RECORDINGS_ROOT now resolves to .../outputs/recordings/, so newly
written recordings land in the new shape without further changes.

Other ref updates:
- .gitignore: every committed-demo whitelist + segments re-ignore moved
  under outputs/recordings/, plus a sibling re-include for the new
  outputs/README.md.
- .github/workflows/javascript-voice-integration.yml: upload-artifact
  path → outputs/recordings/**.
- javascript/README.md: doc link → outputs/recordings/README.md.
- TESTING.md: footprint paths + du command.
- All @e2e demo test docstrings (15 files): "Recording lands in
  outputs/recordings/<demo>/".

Sanity: typecheck PASS, build PASS, tests 791/792 PASS (1 pre-existing
skip, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed
/ PR #492) lists `interruption` in the adapter-capabilities declaration.
The vitest-cucumber binding at voice-contract-surface.test.ts:177 still
had the pre-71dd5ed step title (missing `interruption`), so StepAble
couldn't find the matching feature step.

Update the step title to match the feature file and add the live-adapter
`typeof caps.interruption === "boolean"` check (the empty-adapter check
on line 192 already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eceive re-engages (#567)

Root cause
----------
`ElevenLabsAgentAdapter.sendAudio()` streamed the user PCM then a fixed
16000-byte (~333ms) silence tail to coax ElevenLabs' server-side VAD into
ending the turn. ElevenLabs ConvAI 2.0 detects end-of-turn with a hybrid
VAD + deep-learning turn-detector (prosody, rhythm, micro-pauses), NOT a
pure silence threshold, so a fixed zero-byte blob does not deterministically
trip it on a scripted, non-mic stream. The greeting -> first user turn
happened to work; a scripted 2nd user turn (and the post-interrupt case)
intermittently never re-engaged and `receiveAudio` timed out. This is why
the hosted demo was capped at one exchange and `elevenlabs_interruption`
was gated off (4 prior live attempts timed out).

Protocol research (verified, not guessed)
-----------------------------------------
EL ConvAI exposes NO audio-flush / end-of-turn / commit client event. The
complete client->server union (official Python SDK + JS SDK events.ts) is:
pong | client_tool_result | conversation_initiation_client_data | feedback |
contextual_update | user_message | user_activity | multimodal_message, plus
the bare user_audio_chunk. `user_activity` only resets the inactivity timeout
(does not commit). `user_message` ({"type":"user_message","text":...}) is the
one event that deterministically forces an agent response without mic-style
VAD, and it is the SDK's own sendUserMessage / send_user_message wire shape
so it is server-accepted (does not 400 the socket).

Fix
---
Add `turnCommitMode: "text" | "silence"` (default "text") and a configurable
`silenceTailBytes`. In "text" mode `sendAudio` sends ONLY an explicit
`user_message` turn-commit carrying the chunk transcript (the voice runtime
threads the `scenario.user(...)` script text through as the AudioChunk
transcript). We deliberately do NOT also stream the raw audio in the same
turn: sending user_audio_chunk + user_message together raced EL's audio
ingestion against the text commit and was live-flaky (1/3 raw pass, only
green via retry). Text-only commit re-engages every turn. Nothing observable
is lost — the runtime records the user audio locally (recorder.recordUser,
independent of this send) and EL echoes the committed text back as a
user_transcript so lastUserTranscript still populates. "silence" preserves
the legacy pure-audio VAD path; "text" with no transcript falls back to the
silence tail. ping/pong, transcript capture, agent_response_correction,
drainPendingWaiters, capability advertisement, and the wsFactory seam are
all unchanged.

Verification
------------
- Unit (offline, injectable wsFactory + fake socket):
  src/voice/__tests__/elevenlabs-turn-commit.test.ts — a scripted 2nd user
  turn after an agent turn drives a 2nd receiveAudio resolution, each user
  turn emits a user_message commit (exact shape), post-interrupt re-engages,
  and both silence fallbacks. 7 tests pass; full voice suite 179 pass.
- Live (real EL socket, >=2 exchanges): examples/.../elevenlabs-hosted.test.ts
  extended to greeting -> user -> agent -> user -> agent -> judge. 3
  consecutive clean runs, no retries, no `receiveAudio timed out`; 5 segments
  / 3 agent turns / success=true (the 2nd-turn "support hours" question now
  answers instead of timing out).

Python parity
-------------
python/scenario/voice/adapters/elevenlabs.py has the IDENTICAL silence-tail
limitation and the SAME bug for scripted turn 2+. NOT fixed here: #567 is
scoped to the TypeScript SDK, and a Python fix cannot be live-verified in
this worktree (EL key + harness are JS-side), so shipping it would be
unverified. Follow-up: port the same user_message turn-commit to the Python
adapter and live-verify against python/examples/voice/elevenlabs_hosted.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@drewdrewthis
Copy link
Copy Markdown
Collaborator Author

Fidelity trade-off worth a reviewer's eye (not a blocker):

In default turnCommitMode: "text", scripted user turns after the greeting are committed to the hosted EL agent-under-test as user_message text, not streamed audio. This is what makes turn 2+ reliable — but it means the EL agent-under-test reads the user's words rather than hears the voice on those turns, so tone/prosody (e.g. audible anger, accent, background noise on the user side) is not exercised against a hosted-EL agent past turn 1.

Scope check (verified): the repo's angry-customer demo uses pipecatAgent as the agent-under-test with EL only as the user-simulator voice, so it is unaffected — its agent still receives real audio. The trade-off only bites a scenario whose agent-under-test is the hosted elevenLabsAgent AND which depends on the agent perceiving user-side audio nuance on multi-turn. For that case, turnCommitMode: "silence" preserves audio (at the known turn-2 reliability cost).

Untested alternative for full fidelity: the protocol note rules out audio commits because user_audio_chunk + user_message in the same turn raced and flaked. But EL's multimodal_message client event (in the union we listed) is the one we did not live-test — it's designed to carry audio+text as a single committed turn and could give deterministic commit with audio. Worth a follow-up spike before treating text-only as the permanent answer.

Net: the fix is correct and the right default for reliability; flagging the boundary so it's a conscious choice, not a silent regression.

…ext-turn (#567 parity)

Mirror the TypeScript fix (javascript/src/voice/adapters/elevenlabs.ts,
commit b2a01a1) in the Python ElevenLabsAgentAdapter.

EL ConvAI exposes NO audio-flush / end-of-turn client event, and ConvAI
2.0 end-of-turn is a hybrid VAD + deep-learning turn-detector, not a pure
silence threshold. The legacy "stream audio + fixed silence tail" path
therefore does NOT reliably commit a scripted, non-mic turn 2+ — the 2nd
recv_audio stalled (issue #567).

Changes (faithful parity with the TS adapter):
- New options turn_commit_mode: Literal["text","silence"] = "text" and
  silence_tail_bytes: int = 16000 (snake_case per Python convention).
- send_audio: in "text" mode when the chunk carries a transcript, send
  ONLY {"type":"user_message","text":<transcript>} (the new
  _send_user_message helper) — the only documented client event that
  deterministically forces an agent response. The raw audio is NOT also
  streamed (audio + text in one turn raced EL's ingestion and was
  live-flaky). Otherwise (silence mode, or text mode with no transcript)
  fall back to the legacy user_audio_chunk + silence-tail path, now using
  silence_tail_bytes.
- Module docstring wire-protocol note updated to list user_message and the
  turn-commit rationale, mirroring the TS docstring.

The scripted user text reaches the adapter unchanged: scenario.user("…")
TTS yields AudioChunk(data=pcm, transcript=text) (voice/tts.py), threaded
through extract_audio -> chunk.transcript into send_audio — exact parity
with the TS chunk.transcript path. No runtime changes beyond the adapter.

Verification:
- Unit: tests/voice/test_elevenlabs_turn_commit.py (7 tests, all pass) —
  proves a scripted 2nd user turn drives a 2nd recv_audio, the exact
  user_message shape (type+text only) is sent, post-interrupt correction
  updates the transcript, and the silence fallback + silence_tail_bytes
  resize still work. Existing EL transport tests stay green (74 passed
  across turn-commit + adapters + script_steps + messages + audio_chunk).
- LIVE: examples/voice/elevenlabs_hosted.py against the real EL socket,
  2 user exchanges — success: True. Both user turns (incl. the turn-2
  follow-up "What information do you need…") received coherent agent audio
  responses; no recv_audio timeout. Recorded segments confirm
  agent/user/agent/user/agent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants