fix(voice/ts): explicit EL ConvAI turn-commit so scripted next-turn receive re-engages#596
fix(voice/ts): explicit EL ConvAI turn-commit so scripted next-turn receive re-engages#596drewdrewthis wants to merge 175 commits into
Conversation
…state) Engineering Design Record for the TypeScript voice port (#372): the inside-the-box design the PRD (API proposal) never specified. Pairs the module tree + per-module contract catalog (target vs as-built gap analysis across the voice PR series) with ADR-002, which moves STT/TTS provider state off a module-global singleton onto per-run ScenarioConfig.voice (the only per-run carrier that reaches AgentAdapter.call), removes the invented scenario.configure({stt}) surface, and standardizes one in-message audio format (fixing a live WAV-vs-PCM decode mismatch). Spec only — no runtime change. The clean voice stack is built against this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports python/scenario/voice/{tts,stt,_transcribe}.py to TypeScript and
exposes scenario.configure({ stt }) for swapping the default STT provider.
- voice/tts.ts: synthesize(text, voice, effectFn?) + LRU(64) keyed on
sha256(text)+voice. Effects apply AFTER cache hit per the locked
decision; raw text never reaches the cache payload.
- voice/stt.ts: STTProvider interface, OpenAISTTProvider default
(gpt-4o-transcribe) with 25-minute chunking, ElevenLabsSTTProvider,
setSttProvider / getSttProvider for swap. Pure-TS pcm16-to-wav
encoder — no transcription-only ffmpeg dep.
- voice/transcribe.ts: transcribeSegments — post-hoc, idempotent
per-segment, degrades gracefully when no provider is configured.
- config/configure.ts: scenario.configure({ stt }) entry point.
Tests in follow-up commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tts.test.ts: cache key is (sha256(text), voice); effects apply AFTER
cache hit (third call with new effect reads ORIGINAL cached PCM, not
effect-baked bytes).
- stt.test.ts: default model = gpt-4o-transcribe; provider swap via
setSttProvider; STTProvider interface minimal (no OpenAI types leak);
>25-min audio splits into sub-chunks with concatenated transcripts.
- transcribe.test.ts: transcribeSegments fills missing transcripts in
place, skips already-filled segments; missing STT degrades gracefully
with a warning and never raises.
- configure.test.ts: scenario.configure({ stt }) round-trips a custom
provider; null clears.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cucumber Retrofits PR #513's hand-rolled tests so the 7 scenarios they claim to cover actually load and execute against specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517. Scenarios tagged @ts-tts, @ts-stt, @ts-transcribe (domain-specific sub-tags alongside @Unit) so each test file's includeTags filter targets exactly the scenarios it owns without disturbing voice-contract-surface.test.ts (which uses @ts-bound for the original 5 scenarios from PR1). - tts.test.ts: loadFeature + describeFeature({ includeTags: ["ts-tts"] }) binding "TTS cache key is (text, voice) only and effects apply after cache hit" - stt.test.ts: loadFeature + describeFeature({ includeTags: ["ts-stt"] }) binding 4 STT scenarios: default gpt-4o-transcribe, provider swap, minimal interface, >25-min chunking - transcribe.test.ts: loadFeature + describeFeature({ includeTags: ["ts-transcribe"] }) binding transcribe_segments fills-in-place + missing STT degrades gracefully Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…VAD fallback (WIP) PR3 of N for #372. Builds on PR1 (#511) types. - Port `python/scenario/voice/adapter.py` runtime to `voice/adapter.runtime.ts`: * `asyncio.Event` -> `AgentSpeakingEvent` (Promise + resolve ref) * `async with` -> explicit `startVoiceAdapters` / `stopVoiceAdapters` * Default `call()` body: send -> drain on tail silence -> record -> return * Hook fan-out for `onAudioChunk` / `onVoiceEvent` - Port `python/scenario/voice/vad.py` -> `voice/vad.ts`: * `WebRTCVadFallback` with one-shot warning per adapter (matches Python `_warned_adapters` memoisation, no rate-limit regression) * Activates only when `adapter.capabilities.nativeVad === false` * Pure-TS RMS energy + hysteresis detector ships today; webrtcvad C-library build pipeline is the decision-pending item. - Patch `execution/scenario-execution.ts`: * Implement `VoiceExecutorState` structurally (Decision 1(b) from #372) * Pick voice adapters at run start; connect inside try, disconnect in finally so the spec-148-145 "regardless of pass/fail/exception" contract holds. * Wire `onAudioChunk` / `onVoiceEvent` from `ScenarioConfig`. - Add `voice/__tests__/fixtures/fake-adapter.ts`: in-memory adapter, no real transport. Tests use this exclusively. - Tests (vitest, bound to `specs/voice-agents.feature`): * `adapter-lifecycle.test.ts` lines 138-145 * `hooks.test.ts` lines 449-461 * `vad-fallback.test.ts` lines 772-791 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fail-on-call fixture - ScenarioExecution.reset() recreated ScenarioExecutionState, losing the setExecutor linkage from the constructor. Voice adapters reaching input.scenarioState._executor would see null for the rest of the run, so hook fan-out / recorder never wrote into voice state. Re-attach in reset() so the linkage survives. - FakeVoiceAdapter gains a failOnCall option — cleaner than spawning a second AGENT-role agent that would compete with the fake adapter for the agent() step (the executor picks the first role-matching agent). - All 4 voice test files now green (21/21 voice tests, 381/381 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… vitest-cucumber Retrofits PR #515's hand-rolled tests for adapter lifecycle, hooks, and VAD fallback to actually load and execute specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517 and #513. Tags by test file (per-file tagging needed because vitest-cucumber v6 fails the suite for scenarios that match a file's includeTags but aren't bound in that file): - @ts-adapter: connect/disconnect fires per-scenario - @ts-hooks: on_audio_chunk and on_voice_event fire - @ts-vad: VAD fallback / native-VAD does not trigger / one-shot warning Key implementation note: vitest-cucumber v6 runs each Given/When/Then step as a separate vitest it(). Module-level beforeEach/afterEach hooks fire around each step, not around the whole scenario. For scenarios that need to assert on console.warn calls across step boundaries, the spy is installed locally within the When step and captured warn messages are carried via closure-scoped variables into Then/And — avoiding the floating-promise and spy-reset antipatterns. Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #513 (PR-B, ready for review), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three /review must-fixes:
1. vad-fallback.test.ts: replaced the closure-capture spy pattern with
the library's BeforeEachScenario/AfterEachScenario hooks. The
coder's earlier workaround was based on the false belief that
vitest-cucumber lacked scenario-level lifecycle hooks. The hooks
exist (verified at @amiceli/vitest-cucumber 6.5.0
describe-feature.js:311-322). BeforeEachScenario fires via
beforeAll inside the scenario describe block — once per scenario,
not per step. Spy is shared; capturedWarnCalls accumulates across
steps within the same scenario. Removed ~28 lines of SPY STRATEGY
prose comments.
2. hooks.test.ts: extracted the "throwing hook doesn't break scenario"
check from inside the on_voice_event scenario's When step. It was
asserting behavior the bound feature scenario didn't claim. Now a
plain it() block outside describeFeature. Option (a) chosen: no
spec scenario exists for this behavior in voice-agents.feature.
3. adapter-lifecycle.test.ts: split 5 sub-cases out of one packed And
step. Kept only the happy-path disconnect assertion in the bound
And step (disconnect fires once on success). Lifted fail/throw/
multi-adapter/disconnect-swallow to 4 plain it() blocks. Option (b)
chosen: specs/voice-agents.feature line 143 names the And step as a
single AC ("regardless of pass/fail/exception") — the 4 sub-cases
are implementation-level guarantees not individually specced.
Reviewer convergence: principles + test (3x). Refs #516, #517, #513, #515.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…udio messages (PR4 of N) Ports the python voice path for simulator and judge to TypeScript: - javascript/src/voice/messages.ts: createAudioMessage/extractAudio/ messageHasAudio helpers using the local AudioMessageParam type. No openai package import — uses messages.types.ts (Decision 2(b)). - javascript/src/agents/user-simulator-agent.ts: voice config triggers audio-message emission; per-step voice + per-step audio_effects + persona composition. stripAudioContent keeps LLM calls text-only. - javascript/src/agents/judge/judge-agent.ts: JudgeAgent exported as class with static conversationHasAudio; effectiveIncludeAudio/Timeline/Traces helpers; auto-detect multimodal model via model name substrings; include_audio=false escape hatch. 13 scenarios bound to specs/voice-agents.feature via vitest-cucumber: - 5 simulator scenarios (@ts-simulator) - 7 judge scenarios (@ts-judge) - 1 assistant-role scenario (@ts-assistant-role) Tag convention: per-subject (@ts-simulator / @ts-judge / @ts-assistant-role) instead of @ts-bound to avoid colliding with PR1's voice-contract-surface test (which uses includeTags: ["ts-bound"] and would over-match new scenarios). Per-file tagging is established by #513/#515; tag-convention decision tracked at #523. Refs #372 (slice plan), #517 (PR1 infra, merged), #513 (PR2, ready), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… minor cleanups /review surfaced 4 Must-Fix carry-forwards from prior PRs: 1. "Per-step voice override applies to only that step" scenario asserts no observable behavior — voiceStyle is set/cleared via setOneShotOverride but no TTS provider honors it. Spec retagged @todo (removed @ts-simulator) so future PRs that wire voiceStyle into _synthesize can re-bind. Test block removed. Honest absence beats paraphrase-as-binding. PR4 now binds 12 scenarios (was 13). 2. voice-assistant-role.test.ts doc-comment claimed @integration but feature file tags @Unit. Fixed. Also fixed an internal comment that said "Python SDK" when the context was "TS SDK". 3. judge-voice.test.ts had 4-5 packed Then blocks (multi-model sub-cases stuffed into single bound Thens). Lifted sub-cases to plain it() blocks outside describeFeature; bound Thens now assert only spec-named behavior. 4. Hoisted mid-file zod import to top of judge-agent.ts. Reviewer convergence: principles, hygiene, test. Refs #528, #516, #372. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… extensions (PR5 of N)
PR5 of the TS voice parity slice. Pure SDK orchestration — no external
service is touched, no UI runs. Wires the script-step DSL, interruption
config, recording runtime, and the optional ScenarioResult voice fields
behind the same contract surface the Python SDK already ships.
Adds:
* javascript/src/script/voice-steps.ts — sleep, silence, audio, dtmf,
interrupt (after-time + after-words), agent({ wait: false }),
proceed({ interruptions, onTurn, onStep }), backgroundNoise.
Imports from `@langwatch/scenario` script barrel as `voiceAgent` /
`voiceProceed` so the existing positional `agent`/`proceed` stay
untouched for callers.
* javascript/src/voice/interruption.ts — InterruptionConfig class
with shouldInterrupt / sampleDelay / pickRandomPhrase. RNG-pluggable
so callers can pass a seeded PRNG for deterministic tests.
CONTEXTUAL_PROMPT exported as a module-level constant.
* javascript/src/voice/recording.runtime.ts — VoiceRecordingRuntime
with WAV writer (native; canonical PCM16/24kHz/mono RIFF header) and
MP3/OGG/FLAC via system ffmpeg subprocess. saveSegments() writes the
segments dir + full.wav + JSON manifest. computeLatencyMetrics()
aggregates avg/p50/p95 with ceiling-style p95.
* ScenarioResult gains optional `audio`/`timeline`/`latency` fields —
text-only runs leave them undefined (back-compat preserved).
Test files (all bound via vitest-cucumber against specs/voice-agents.feature):
* src/script/__tests__/voice-steps.test.ts (11 scenarios, @ts-script-step)
* src/voice/__tests__/interruption.test.ts (1 bound + 2 unit, @ts-interruption-cfg)
* src/voice/__tests__/recording.runtime.test.ts (7 unit — not feature-bound)
* src/voice/__tests__/result-extensions.test.ts (6 scenarios, @ts-result-ext)
Spec tags: @ts-script-step / @ts-interruption-cfg / @ts-result-ext sub-tags
scope each PR5 file's binding set; voice-contract-surface.test.ts now
uses excludeTags to keep ownership of the PR1 contract-surface set only.
Tsconfig: target=ES2022 so top-level await (vitest-cucumber pattern)
and `Set` iteration land without --downlevelIteration shims.
ffmpeg distribution decision pending — see PR body for options.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…faces Addresses /review concerns on PR5: - Lift voiceInterruptions + voiceBackgroundNoise onto VoiceExecutorState so voiceProceed/backgroundNoise write through the same typed contract the voice subsystem already commits to (Decision 1(b) of #372). Drops three `as unknown as { _voice* }` indirections from voice-steps.ts. - Expose agentSpeakingEvent + streamingTranscript + sendDtmf on VoiceAgentAdapter as optional/abstractable members. dtmf() now calls adapter.sendDtmf() directly — adapters that claim capabilities.dtmf while skipping the method get a loud UnsupportedCapabilityError from the base class instead of a silent PCM synthesizer fallback. - Add bounded timeout to waitForStreamingWords so a wedged adapter that never advances its transcript can't lock the script forever (mirrors waitForAgentSpeaking's pattern). - audio() URL_LIKE error message no longer suggests "download the asset locally" when the input is already a file:// URI. - recording.runtime.test.ts skips MP3 transcoding cleanly when ffmpeg is not on PATH (itIfFfmpeg guard). - Drop the unused DTMF PCM-synth fallback now that capability-method coupling is enforced at the base class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s (PR6 of N) Ports python/scenario/voice/effects/* to javascript/src/voice/effects/*: - common.ts (EffectFn type, PCM16 <-> Int16Array helpers) - noise.ts (backgroundNoise, static_, multipleVoices) + 5 bundled WAVs - prosody.ts (lowVolume, highVolume, speakingFast, speakingSlow) - quality.ts (phoneQuality via fft.js, lowQuality, packetLoss, echo, robotic, breakingUp) - custom.ts (user-fn wrapper with type validation) - index.ts barrel re-exporting static_ as static Adds fft.js dep (FFT for phoneQuality bandpass). Updates tsup.config.ts to cpSync src/voice/assets to dist/voice/assets; package.json files includes src/voice/assets/** so WAVs ship in published npm package. Bundle delta ~132KB (5 x 24KB WAVs + LICENSES) — under the 1MB budget. Binds 5 scenarios in specs/voice-agents.feature with tag @ts-effects (per-subject tag, NOT @ts-bound, to avoid collision with PR #517's voice-contract-surface.test.ts that already owns @ts-bound; follows PR #528 convention from issue #523). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review fanout flagged:
- effects unreachable via voice namespace (voice/index.ts had no re-export)
- TS2802 on [...BACKGROUND_PRESETS].sort() (Set iteration)
- require('fft.js') with manual type cast + eslint suppression
- conjugate-symmetry mirror hand-rolled instead of fft.completeSpectrum()
- 3 near-identical linearResample loops across noise/prosody/quality
- double static_/static export (pick one for the public name)
Fixes:
- voice/index.ts: export * as effects from './effects'
- effects.test.ts: regression assertion via voice namespace import
- noise.ts: Array.from() instead of spread; use linearResample helper
- quality.ts: import FFT from 'fft.js'; fft.completeSpectrum(); linearResample x2
- prosody.ts: linearResample helper
- common.ts: new linearResample(arr, newLen): Int16Array
- effects/index.ts: drop bare static_ re-export, keep only static alias
- effects.test.ts: JSDoc note that on_turn Scenario binding is a unit-level
proxy for the runtime hook that lands in PR3 (#515)
pnpm -C javascript build: green
pnpm -C javascript test: 22 files / 392 tests pass
pnpm -C javascript typecheck: pre-existing TS1378 from PR #517 only; no
new errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…diom Review nits from re-review of PR #537: - public-API surface test asserted only 3 callables; iterate all 14 §4.5 effects so a missing barrel re-export fails fast. - prosody._resampleFactor wrapped linearResample with int16ToPcm16 while quality.lowQuality used `new Uint8Array(buf.buffer)`. The clip in int16ToPcm16 is a no-op on Int16Array input — use the zero-copy view in both places. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anded (PR7 of N) PR7 of issue #372 — the first real voice transport. Ports three Python adapters to TS and binds 7 scenarios in `specs/voice-agents.feature`. What lands: - `javascript/src/voice/adapters/elevenlabs.ts` — `ElevenLabsAgentAdapter`, the hosted ConvAI adapter. Connects to `wss://api.elevenlabs.io/v1/convai/conversation` via the `ws` package; PCM16/24kHz base64-over-JSON; full event handling (audio, ping, transcript, correction, init-metadata, interruption). Mirrors `python/scenario/voice/adapters/elevenlabs.py`. - `javascript/src/voice/adapters/composable.ts` — `ComposableVoiceAgent` + `STTProvider` interface + `ElevenLabsSTTProvider` + inline `synthesize` helper (elevenlabs/ provider only — PR2 #513 supplies the rest). LLM is any ai-sdk `LanguageModel`. Mirrors `python/scenario/voice/adapters/composable.py`. - `javascript/src/voice/adapters/eleven-labs-voice-agent.ts` — `ElevenLabsVoiceAgent`, the branded preset. Provider-typed options; defaults to `ElevenLabsSTTProvider` + `openai("gpt-5.4-mini")` + `elevenlabs/EXAVITQu4vr4xnSDxMaL` (Sarah — free-tier premade); each piece independently overridable. `eleven_v3` TTS model hardcoded for paralinguistic-marker support (per Python tts.py:107 comment). Tests: - `javascript/src/voice/adapters/__tests__/elevenlabs.test.ts` — 5 unit scenarios bound via `describeFeature(..., { includeTags: [["unit", "ts-elevenlabs"]] })`. - `javascript/examples/vitest/tests/voice/elevenlabs-hosted.test.ts` — 2 e2e scenarios env-gated on `ELEVENLABS_API_KEY` (+ `ELEVENLABS_AGENT_ID` for the hosted demo). Without keys, the suite cleanly skips. Tag convention: `@ts-elevenlabs` (per-subject) rather than `@ts-bound` — per the precedent from PRs #517 / #528 (`@ts-simulator`, `@ts-judge`, `@ts-assistant-role`), per-subject tags avoid the `checkUncalledScenario` collision with PR1's contract-surface test. See #523 for the tag-convention decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rotocol tests Review pass on PR #536 surfaced four actionable concerns. Addressed: - **#1 (blocking) — `connect()` left WS without `error`/`close` handlers after `onOpen` called `removeAllListeners()`.** An unhandled `error` on a Node EventEmitter crashes the process. Re-attach `message` + `error` + `close` listeners atomically post-open. The new `error` handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio` fail fast instead of writing to a dead socket. Pending receivers drain to empty `AudioChunk` so the executor unwinds rather than hanging. - **#2 (blocking) — `onMessage` branches were untested.** Added 14 wire-protocol unit tests (plain vitest, not cucumber-bound) covering: base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO, ping → pong with `event_id`, ping defensive (no `event_id` skip), `user_transcript` capture, `agent_response` capture, `agent_response_correction` override, format-drift warning, interruption + unknown event swallow, non-JSON frames ignored, post-open socket error drain, socket close drain, and `receiveAudio` timeout. - **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`, violating `voice-models.ts`'s self-declared single-source-of-truth contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` + `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` + `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity: `python/scenario/config/voice_models.py`). Adapters now import from there. - **#6 — `receiveAudio` referenced `waiter` from inside the timer body before its `const` declaration.** Worked by event-loop ordering; fragile to refactor. Forward-declared `let timer` and put `waiter` ahead of the `setTimeout` so the dependency graph is explicit. Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests). Build: tsup CJS + ESM + DTS clean. Deferred (intentional, tracked in PR body): - #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design with PR2 (#513); merge-order constraint. - #7: `turnOutputEmitted` latch contract with PR3 executor — surface in PR3 review. - #8: distinguish natural end-of-turn from socket close — design-level, needs PR3 design conversation. - #9: `featurePath()` helper — extract once a 3rd test file would duplicate the climb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r roles) (PR8 of N)
Port `python/scenario/voice/adapters/openai_realtime.py` to TypeScript at
`javascript/src/voice/adapters/openai-realtime.ts`. The adapter owns the
OpenAI Realtime wire protocol directly — the model IS the agent under
test (`role=AgentRole.AGENT`) or the voice-enabled user simulator
(`role=AgentRole.USER`, per §7.2 L1164-1171).
User-role critical path: scripted `user("text")` lines call `sendText`,
which emits `conversation.item.create` (`input_text` content) +
`response.create` directly. TTS is bypassed — the realtime model owns
prosody synthesis.
Wire-protocol behavior:
- WSS to `wss://api.openai.com/v1/realtime?model=<model>` via `ws`
- `session.update` post-connect (pcm16/24000 in/out, voice, instructions,
tools, server-side VAD off so we own turn boundaries)
- `sendAudio` → `input_audio_buffer.append` (deferred commit)
- `receiveAudio` → commit + response.create on first call, loops over
events until `response.audio.delta`; transcript deltas update
`lastAgentTranscript`, Whisper user transcripts update
`lastUserTranscript`
- `interrupt()` → `response.cancel` (first-class interrupt per §5.6)
Scenarios bound (`specs/voice-agents.feature`):
- @Unit @ts-openai-realtime — agent connect + user-simulator wiring
- @e2e @ts-openai-realtime-agent-demo — live agent-role round-trip
- @e2e @ts-openai-realtime-user-demo — live user-simulator with sendText
Per-subject tags avoid collision with PR1's `voice-contract-surface.test.ts`
which uses `includeTags: ["ts-bound"]` (single-axis OR). Dual-axis filters
`[["unit", "ts-openai-realtime"]]` keep unit binding tight.
Tests:
- `javascript/src/voice/adapters/__tests__/openai-realtime.test.ts` — 2
@Unit scenarios driven against an in-process `ws` server (asserts
wire-protocol shape, transcript accumulation, response.cancel,
capability matrix). 7 step assertions pass.
- `javascript/examples/vitest/tests/voice/openai-realtime-agent.test.ts`
— agent-role e2e demo, env-gated on `OPENAI_API_KEY` via
`Scenario.skip`.
- `javascript/examples/vitest/tests/voice/openai-realtime-user.test.ts`
— user-role e2e demo proving `sendText` is the TTS-free path.
Dependencies:
- Adds `ws` 8.20.1 + `@types/ws` 8.18.1 to the javascript workspace
(Realtime WSS transport).
/browser-qa-against-prod evidence env-gated: `OPENAI_API_KEY` UNSET in
the grinder's environment so e2e demos report as skipped. CI gate runs
them when the secret is configured.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tools, sync disconnect) Surfaced by /review skill (PR #535): - **Sync disconnect:** `disconnect()` now eagerly rejects any in-flight `receiveAudio` waiter and flushes the event queue instead of relying on the async `close` handler. Prevents waiters from blocking past the close and stale-queued events from leaking into the next session. - **API key validation:** `connect()` throws a named diagnostic when no key is set, instead of letting the request surface as a generic WebSocket 401. - **`url` init knob:** `OpenAIRealtimeAgentAdapterInit.url` lets tests point at a loopback WS server without subclassing the adapter. The unit test now constructs the adapter directly — the `TestAdapter` subclass is gone. - **Structural tool type:** `tools: unknown[]` → `RealtimeToolDef[]` (exported), so call-site typos surface at compile time. Sets the template for the four remaining adapter ports. - **Single timeout site:** dropped the unreachable outer-loop deadline check in `receiveAudio` — `_nextEvent` already arms a per-iteration timer that fires the same error. - **PCM16 truncate removed:** the AudioChunk constructor already enforces even-byte invariant; adapter-side truncation was belt-and-suspenders that would hide an upstream codec bug. - **E2E agent demo:** moved the `expect(chunk).toBeInstanceOf(AudioChunk)` assertion from `When` into `Then` where it belongs. Deferred (out-of-scope or PR3 territory): - Logger surface for non-JSON frame drops (Python emits `logger.debug`; TS port has no logger yet — file when the SDK introduces one). - `responseTimeout` / `responseTailSilence` / `responseMaxDuration` are inherited from `VoiceAgentAdapter` but inert until PR3 wires the executor. PR3 must consume them. Gates re-validated: build green (CJS + ESM + DTS), 383/383 tests pass, eslint clean on touched files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure root cause: `AudioChunk`, `OpenAIRealtimeAgentAdapter`, `OPENAI_REALTIME_MODEL`, `silentChunk` are exposed at the package root via `export * as voice from "./voice"` — they're NOT named exports on the root barrel. Direct named imports resolved to `undefined`, so `expect(firstChunk).toBeInstanceOf(AudioChunk)` saw `undefined` and `new OpenAIRealtimeAgentAdapter(...)` was a `TypeError`. Switched both e2e demos to destructure from the `voice` namespace and narrowed the local type aliases to `voice.AudioChunk` / `voice.OpenAIRealtimeAgentAdapter`. Unit tests are unaffected — they import from the local `../../index` re-export and never see the package root. CI was running the e2e demos because `OPENAI_API_KEY` IS configured in the CI env. Locally the same path skips (key unset). The skip-path test exit was a false positive — the actual binding consistency check needed the run path to fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s it) CI surfaced the real issue: the OpenAI Realtime endpoint at `wss://api.openai.com/v1/realtime` is now GA and rejects the `OpenAI-Beta: realtime=v1` opt-in with: The Realtime Beta API is no longer supported. Please use /v1/realtime for the GA API. We were sending the header per Python parity (`python/scenario/voice/ adapters/openai_realtime.py`); the GA migration deprecates it. Dropped the header and updated the file-level docstring to document the choice. Python parity is intentionally broken here — Python adapter still sends the Beta header and will hit the same error. Track for back-port to keep the two SDKs aligned. Local: 383/383 unit tests pass, build green. CI re-run pending; e2e demos should now connect successfully against the GA endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced "Missing required parameter: 'session.type'" after the
Beta-header drop — the GA Realtime API restructured the session config
significantly (per RealtimeSessionCreateRequest in openai-node
realtime.ts).
Migrated session.update payload:
- session.type: "realtime" (required discriminator)
- session.model: passes the model id explicitly
- audio formats moved under session.audio.{input,output}.format as
{ type: "audio/pcm", rate: 24000 } objects
- voice moved under session.audio.output.voice
- transcription + turn_detection nested under session.audio.input
Unit test wire-shape assertions updated to match. Old shape fields
(input_audio_format, output_audio_format, top-level voice, top-level
turn_detection) are gone; the assertions now look at
audio.input.format, audio.output.voice, etc.
Python parity is intentionally broken here — the GA migration deprecates
the wire surface Python uses. Track for back-port to keep the SDKs
aligned. The Python adapter will hit the same error against the live
endpoint.
Local: 383/383 unit tests pass, build green (CJS + ESM + DTS).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI issues after the GA wire-shape migration: 1. **Voice 'nova' is Beta-era, GA rejects it.** Supported voices are alloy/ash/ballad/coral/echo/sage/shimmer/verse/marin/cedar. Switched the user-role demo to `marin` (OpenAI's recommended modern voice). The BDD scenario text still names "nova" — that documents Python's parity intent; the test picks a valid GA voice. 2. **Agent-role demo deadlocks on silentChunk.** Sending 0.5s of silence to a Realtime session with `turn_detection: null` doesn't trigger the model; receiveAudio(20) times out and `chunk` stays null. The unit scenarios already prove the audio round-trip via a mock WS. The e2e demo's job is to prove live-endpoint connectivity, so rewrote it as a smoke test: - connect (GA handshake + session.update accepted) - interrupt (response.cancel round-trips against the live wire) - disconnect The Then assertion now verifies connectError is null and the capability matrix is published — wire health, not a model response. PR3 will drive real speech audio through the executor. Local: 383/383 unit tests pass.
CI: receiveAudio timed out after 81s on the user-role e2e demo. Root cause: GA renamed the streaming output events: Beta → GA response.audio.delta → response.output_audio.delta response.audio.done → response.output_audio.done response.audio_transcript.delta → response.output_audio_transcript.delta response.audio_transcript.done → response.output_audio_transcript.done The Beta names are no longer emitted by the live endpoint, so the receive loop never saw an audio frame. Updated the event matcher to accept both names. The new GA name wins on the live endpoint; the Beta alias keeps the existing unit tests (which push the legacy event names) working without churn, and makes back-port to any Beta-era endpoint trivial. Local: 383/383 tests pass.
Ports python/scenario/voice/adapters/gemini_live.py → javascript/src/voice/adapters/gemini-live.ts using @google/genai (the new SDK; @google/generative-ai is the deprecated package). - GeminiLiveAgentAdapter with capabilities matrix (streaming transcripts, native VAD, interruption, pcm16/16000 in, pcm16/24000 out) - PCM16 24kHz↔16kHz resampler in pure JS (linear interpolation, no scipy) - Callback-to-queue bridge mapping the SDK's onmessage callback onto an awaitable receiveAudio(timeout) contract - @google/genai declared as optional peer dep; lazy-imported on connect() so the SDK ships without a hard Gemini coupling - 2 @Unit scenarios (connect, capabilities matrix) bound via vitest-cucumber + 1 @e2e demo scenario (env-gated on GEMINI_API_KEY/GOOGLE_API_KEY) Refs #372.
…f N)
Ports python/scenario/voice/adapters/{pipecat.py,_twilio_shared.py} to
TypeScript so voice scenarios can target a running Pipecat bot over the
Twilio Media Streams WS protocol. WebRTC transport is deferred and
raises PendingTransportError at connect() time.
New files
- src/voice/adapters/twilio-shared.ts — g711 µ-law 8 kHz ↔ PCM16 24 kHz
codec + 24k/8k linear-interpolation resampler + Twilio Media Streams
frame parser/builders. Reused by the upcoming TS Twilio adapter (PR11).
- src/voice/adapters/pipecat.ts — PipecatAgentAdapter speaking the
synthetic connected/start handshake, 20 ms µ-law media frames, clear
for first-class interrupt, mark "utterance_end" as end-of-turn signal.
- src/voice/adapters/pending-transport-error.ts — shared deferred-
transport error class (parity with python _stub.PendingTransportError).
- src/voice/adapters/__tests__/twilio-shared-codec.test.ts — binds the
two @ts-codec scenarios (round-trip fidelity + sample-rate conversion)
plus plain-vitest edge-case tests.
- src/voice/adapters/__tests__/pipecat.test.ts — binds the three
@ts-pipecat scenarios (WS round-trip, WebRTC PendingTransportError,
clear-buffer interrupt) against a synchronous fake WebSocket.
Capabilities advertised
streamingTranscripts=true, nativeVad=true, dtmf=false,
interruption=true, input/outputFormats=[pcm16/24000, mulaw/8000].
Notes for reviewers
- 5 feature-file scenarios are bound (2 retagged, 3 new). Tag axis is
@ts-pipecat / @ts-codec to match the @ts-<adapter> precedent set by
PR #535 (OpenAI Realtime) and PR #536 (ElevenLabs).
- /browser-qa-against-prod is env-gated on SCENARIO_PIPECAT_QA_WS_URL.
CI does not set the var; documented under "/browser-qa note" in the
PR body. No script ships in this PR — adding one would require a
user-owned bot endpoint we don't have.
- `ws` 8.20.1 + @types/ws 8.18.1 added as deps (matches PR #535).
- tsconfig.target=ES2022 added (matches PR #535).
…cases Addresses 5 review concerns (review #540 synthesizer pass): - #1 perf: receive-side mulaw buffer now stores Uint8Array slices, not number[]; bufferMulaw is O(1) per call instead of O(n) per byte. - #2 docs: coerceFrameToText's 0x7b/0x5b heuristic is now documented as a known rare-collision risk (binary µ-law with first byte == { or [ would mis-route to JSON parser and silently drop). - #4 test pyramid: round-trip scenario re-tagged @Unit (FakeWebSocket = no network) — real-WSS @integration demo deferred behind env-gated bot endpoint per /browser-qa note. - #5 coverage: 2 new edge-case tests for partial-buffer flush on bot-sent `stop` event and on socket-close. Not addressed in this PR (filed as follow-up considerations): - #3 vestigial audioFormat/sampleRate fields (inherited from Python parity) - #6 DTMF/E.164 validation regex port (pre-requisite for PR11 Twilio) - #8 extract TwilioMediaStreamsTransport helper (PR11 prep) - #9 JSON-frame size cap (no regression vs main; same constraint as Python) - #10 FakeWebSocket vs node:events (cosmetic)
…1 of N)
Ports python/scenario/voice/adapters/{twilio,_twilio_server,_twilio_shared}.py
to TypeScript:
- `twilio-shared.ts` — µ-law/PCM16 codec (8 kHz ↔ 24 kHz resample inline,
no `audioop` in Node), Media Streams JSON frame parser/builders, E.164
+ DTMF validators, minimal Twilio REST client over fetch (no `twilio`
npm SDK), HMAC-SHA1 signature verification.
- `twilio.ts` — `TwilioAgentAdapter` extending `VoiceAgentAdapter`.
Capabilities: `inputFormats: ["mulaw/8000"]`, `outputFormats: ["mulaw/8000"]`,
`interruption: true` (clear-buffer event), `dtmf: true`. Implements
`placeCall`, `waitForCall`, `sendAudio`, `receiveAudio`, `sendDtmf`,
and `interrupt`.
- `twilio-server.ts` — local HTTP + WS server (node `http` + `ws`) that
impersonates Twilio's media-stream endpoint. Binds on an OS-assigned
port (no hard-coded 8765). TwiML route returns `<Connect><Stream>` with
the stream URL XML-escaped; signature gate fails closed.
- `twilio-tunnel.ts` — wraps `@ngrok/ngrok` (preferred) with a
`localtunnel` fallback. Both are dynamic-imported as optional peer
deps so they don't bloat the runtime bundle.
Scenarios bound in `specs/voice-agents.feature` via vitest-cucumber:
- `@integration @ts-bound @ts-twilio-proto` x3 — capabilities, JSON
protocol parser, clear-buffer interrupt (twilio.test.ts).
- `@integration @ts-bound @ts-twilio-server` x2 — TwiML response shape +
XML-escape, signature rejection (twilio-server.test.ts).
- `@e2e @ts-bound @ts-twilio-tunnel` x1 — tunnel exposes local server.
Env-gated on NGROK_AUTHTOKEN (twilio-tunnel.test.ts).
Boy scout fixes in the same commit:
- `tsconfig.json` — added `target: "ES2022"` so `tsc --noEmit` accepts
top-level await + iterators. Without this, `pnpm typecheck` is broken
on `main` post #517 (the @ts-bound retrofit shipped top-level await
but didn't update the target).
- `voice-contract-surface.test.ts` — narrowed `includeTags` from
`["ts-bound"]` to `[["ts-bound", "ts-contract-surface"]]`. The
retrofit's broad filter was destined to over-include any future
`@ts-bound` scenario (PR-B/C/etc.); my Twilio scenarios surfaced the
bug. Re-tagged the five contract-surface scenarios accordingly.
- `package.json` — added `ws@^8.20.1` runtime dep + `@types/ws` devDep.
Hazards documented in PR body:
- PR10 (Pipecat g711) hadn't pushed at branch time, so PR11 owns
`twilio-shared.ts`. When PR10 lands, the two files reconcile (same
module name and surface area).
- `@ngrok/ngrok` is a heavy native dep — kept optional and dynamic-
imported so CI machines without NGROK_AUTHTOKEN don't pull it.
- Tunnel test is env-gated; CI does not exercise it.
Refs #372.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i adapter
The TS Gemini Live adapter previously detected the spurious
'interrupted → turnComplete' pair after a barge-in and 'continue'd the
dequeue loop. Worked in isolation, but lost the race in the live
executor: the cancelled-turn boundary arrived after the bounded
interrupt() drain returned, so the demo compensated with two
scenario.agent() calls.
Port Python's session.receive() iterator-restart pattern (in-place):
on spurious-pair detection, reset detection state AND extend the
receiveAudio() deadline by SPURIOUS_PAIR_RECOVERY_MS (10s) so delayed
recovery audio is captured within the same call.
+1 unit test ('extends the deadline after the spurious pair') locks
in the fix — 600ms delay overflows the original 500ms budget but lands
within the extended 10s window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…io.agent() The two-scenario.agent() workaround was the demo-level compensation for the timing race fixed in the prior commit. With the iterator-restart deadline-extension in receiveAudio(), one agent() call now captures the recovery within the same call. Mirrors python/examples/voice/gemini_live_interruption.py exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e scenario.agent()" This reverts commit 0dcf1fc.
… UserSimulatorAgentWithVoice
- New canonical file: javascript/src/domain/agents/agent-shapes.ts
- Uses generic { tts?: { voice?: string } } instead of VoiceConfig import
so the file stays import-free of the voice layer (domain owns shapes,
voice layer owns transport capabilities)
- Adds UserSimulatorAgentWithVoice derived narrowed type (voice: string,
not optional) so executors holding a isVoiceUserSim()-guarded reference
get type-safe access to voice without a secondary null-check
- domain/agents/index.ts re-exports the new shapes (isRealtimeUserAgent,
isVoiceUserSim, RealtimeUserAgent, VoiceUserSimulator,
UserSimulatorAgentWithVoice)
- voice/agent-shapes.ts becomes a @deprecated re-export shim pointing at
the new canonical location — keeps old import paths working without churn
- scenario-execution.ts updated to import from domain/agents/agent-shapes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Items addressed (items 1 and 6 skip no-touch files recording.types.ts, gemini-live.ts): - Item 2: collapse pipecat interrupted+discardingInboundAudio into single interruptPhase: "idle"|"interrupted" — the two flags were always set and cleared in lockstep; unifying eliminates the risk of one advancing while the other lags. Both consumer points (receiveAudio drain gate, bufferMulaw frame-discard gate) now check interruptPhase === "interrupted". - Item 3: drop redundant bargeInDelayMs && prefix in fireUserInterrupt. TypeScript requires a null-coalescing approach (?? 0) since the field is number|undefined; simplified to (bargeInDelayMs ?? 0) > 0. - Item 4: justify the Math.floor asymmetry between the two delaySeconds call sites. maybeScheduleInterruptedAgentTurn stores the value as integer ms (floor intentional); maybeInjectInterruption passes directly to setTimeout (fractional ms fine, Node clamps). Added a comment explaining the asymmetry so it no longer reads as a forgotten floor. - Item 5: trim interrupt() JSDoc from multi-paragraph essay to enumerated side-effects list. The method name conveys intent; the side-effects list is the useful reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…med substeps
Reduces the ~115-line monolith to a ~20-line orchestrator by extracting
three private methods:
- resolveNextAgentForInlineBarge(): walks pendingRolesOnTurn to verify
AGENT is the next runnable role and returns { idx, agent } or null.
Replaces the inline bail/lookup phase (steps 1+guard checks).
- consumePendingRolesUntilAgent(agent): removes AGENT from
pendingAgentsOnTurn and shifts pendingRolesOnTurn up to and including
the AGENT slot so the next _step advances to JUDGE cleanly. Mirrors
Python's while-loop.
- dispatchAgentBackground(idx): constructs the non-blocking task entry
(adds .then(()=>undefined) to coerce callAgent's Promise<ScenarioResult|null>
to Promise<void>), sets pendingAgentTask, and returns the entry for
caller inspection. Mirrors Python's asyncio.create_task().
- prepareAndFireBargeIn(config, voiceUserSim, entry): samples delay,
TTS phrase, fires the inline barge-in, records the user message.
Adds 6 targeted unit tests in proceed-interruptions.test.ts covering the
main branches of each substep (resolveNextAgentForInlineBarge returns
null when USER comes first / when queue is empty;
consumePendingRolesUntilAgent pops up to AGENT+leaves JUDGE;
dispatchAgentBackground sets pendingAgentTask and flips done=true).
Tests: 775 pass / 1 skipped (up from 769).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent for barge-in reuse - New file javascript/src/voice/utils.ts with export function sleep(ms): single canonical Promise-based sleep (ms <= 0 resolves immediately, no timer allocation). Issue #576 smell 1. - pipecat.ts: imports sleep from ../utils; removes the local copy. - voice-steps.ts: imports sleepMs from ../voice/utils; removes the local delay() function; all 5 delay() call sites delegate to sleepMs(). - adapter.runtime.ts: appendEvent() promoted from module-private to exported, with JSDoc. Single canonical writer for voice timeline events (push to voiceTimeline + voiceRecording.timeline + onVoiceEvent hook). Issue #576 smell 2. - scenario-execution.ts: - imports appendEvent from adapter.runtime - imports sleep from voice/utils - 3 inline setTimeout-based sleeps replaced with sleep(ms) - 2 inline 3-part push/timeline/hook sequences replaced with appendEvent(this, event) — prevents event-shape drift by centralising the write in one place Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…under interruptOverrides bag Replaces three formerly scattered @internal public fields with a single named gateway: exec.interruptOverrides = { rng: () => 0 } Fields in the bag: - rng?: () => number — replaces the public interruptRng field (now a private getter that reads interruptOverrides.rng ?? Math.random). Tests no longer need `as unknown as { interruptRng: () => number }`. - waitForSpeechMs?: number — seam for fireUserInterrupt's speech-wait bound (runtime path still uses the per-barge-in interruptWaitForSpeechMs field set by the interrupt() step; override bag is a fallback). - bargeInDelayMs?: number — seam for the post-speech barge-in delay (runtime path still uses the per-barge-in interruptBargeInDelayMs field set by prepareAndFireBargeIn; override bag is a fallback). Test updates: - proceed-interruptions.test.ts: 4 interruptRng cast sites → interruptOverrides - proceed-interrupt.test.ts: 1 interruptRng cast site → interruptOverrides - proceed-interrupt-errors.test.ts: 2 interruptRng + 1 interruptBargeInDelayMs cast sites → interruptOverrides / direct field access (field is still public) Zero `as unknown as { interruptRng: ... }` casts remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #582 (most sub-items). Deferred: cannedPhrases derivation from InterruptionConfig — owned by parallel #583 work on random-interruptions.test.ts. - PII-log redaction: gemini-live-interruption.test.ts recoveryTranscript log now reports length-only and is gated to the fired_after_speech branch - WAV/LFS policy: decision captured in TESTING.md (trigger at 50 MB) - Timeout tightening: proceed-interrupt.test.ts tests reduced 30 s → 5 s - Behavioral test name: renamed mechanism-describing test to outcome-describing - Factory test relocation: "exposes the value" moved to new agents/__tests__/user-simulator-agent.factory.test.ts - beforeEach reset: gemini-live.test.ts standalone describe gains beforeEach(() => { captured.last = null }) and drops per-test manual resets - bargeInDelayMs > 0 branch: two new focused tests in proceed-interrupt.test.ts - Serializer round-trip: two new tests in recording.runtime.test.ts cover transcript_truncated=true and absence branches in saveSegments manifest Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ioPlayback })
Mirrors Python's ffmpeg/PyAudio approach: when audioPlayback is true,
each agent/user audio chunk is fanned out to a per-run AudioPlaybackSink
alongside the recording. Headless CI gracefully no-ops when no audio
device is available.
Re-activates the configure({ audioPlayback }) knob that was stored-but-not-
yet-consumed — it is no longer dead.
Implementation:
- new javascript/src/voice/playback.ts: AudioPlaybackSink class using the
bundled ffmpeg-static binary piping PCM16 to platform audio driver
(ALSA/AudioToolbox/DirectShow). Degrades gracefully: error or non-zero
exit from ffmpeg emits one console.warn and no-ops sendChunk.
- VoiceExecutorState: add audioPlaybackSink optional field.
- adapter.runtime.ts fireAudioChunk: fan-out 2 — sends to sink alongside
the existing onAudioChunk hook.
- scenario-execution.ts: construct sink when audioPlayback === true (per-run
voice config wins over global configure() per ADR-002); close after
stopVoiceAdapters in finally.
- configure.ts: update doc comment — no longer "stored-but-not-yet-consumed".
Tests:
- playback.test.ts: 8 unit tests for AudioPlaybackSink (subprocess mock) +
2 executor-wiring tests (spawn called iff audioPlayback: true).
Device-bound caveat: cannot live-verify the sink without an audio device —
the unit tests use a mocked subprocess. E2E verification requires a host
with ALSA/AudioToolbox/DirectShow.
Closes #585.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…in playback.test Two regressions from 05f9f4c: - playback.ts:95,109 accessed proc.stdin/proc.stderr without null guards; child_process.spawn returns those as Readable|null when stdio is not configured as 'pipe'. Added guards with warn-once graceful degrade. - playback.test.ts hit vi.mock hoisting: the mock factory referenced a module-level mockProc declared AFTER the vi.mock() call (which vitest hoists to the top). Switched to vi.hoisted() so mockProc is available at hoist-time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ttern GeminiLiveAgentAdapter.dequeue() had a single-consumer resolveNext slot. When interrupt() called dequeue() concurrently with an in-flight receiveAudio() (during a barge-in), the second caller overwrote the first caller's resolver → the in-flight receiveAudio() timed out with TimeoutError → drainAgentResponse caught and broke → interrupted segment ended truncated but no recovery captured. Switch interrupt() to a non-competing abort-sentinel pattern: set a _interruptPending flag + wake any in-flight dequeue with the sentinel. receiveAudio()'s loop checks the flag at each iteration and returns the cut-off sentinel promptly. interrupt() no longer competes on the queue. New unit test locks in the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cut-off Now that the Gemini Live adapter's dequeue() concurrency race is fixed (prior commit), the barge-in properly cuts mid-stream + recovery is captured. Switch random_interruptions from Pipecat (burst-streams TTS, defeating post-arrival cancel) to Gemini Live (realtime-streaming + server-side cancel + now-fixed adapter). Re-adds the median-shorter assertion (ratio < 0.8): the truncated segment is now meaningfully shorter than the median agent reply. Closes #583 — pending live verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or real cut-off" This reverts commit 21376a6.
From the post-grind /review pass: - CORRECTNESS: playback.ts close() hung if subprocess exited early. Set this._proc = null in exit handler (mirroring error handler) so the if (!this._proc) guard in close() short-circuits. +1 regression test. Also: proc.on → proc.once in close() to stop leaking listeners. - HYGIENE: twilio.ts was the 5th sleep callsite the #576 consolidation missed — migrated to shared sleep() from voice/utils. - HYGIENE: 2 test files (proceed-interrupt-errors, proceed-interrupt) used inline new Promise(setTimeout) instead of shared sleep(). - SECURITY: gemini-live.ts error message bound JSON.stringify(goAway) to 300 chars to prevent unbounded log growth from large server struct. - TEST: proceed-interrupt-errors.test.ts 30s timeouts tightened to 5s (actual runtime ~200ms; matches sister proceed-interrupt.test.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (TS-only generator) Audit feedback: repo-root scripts/ is for cross-language tooling; this script writes only to javascript/src/voice/assets/noise/ and has zero Python callers. Belongs in javascript/scripts/ alongside other TS-only asset/build tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…by description) Audit feedback: the doc reads "Engineering Design Record (sits between the PRD and the PRs)" — that's ADR-shape. Sitting in docs/voice/ next to capability-matrix.md and happy-path-*.md (which are docs-site material) was wrong placement. Header reshaped to match ADR-001/002 convention (title prefixed ADR-003, Date/Status preamble, companion-doc paths fixed for the new location). Body framing line removed where the header preamble now says the same thing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s → outputs User feedback: "recordings" describes the file format; "outputs" describes the purpose (these dirs hold what the example tests produced). The helper that writes here keeps its name (saveDemoRecording) — it still SAVES a recording, the recording is just NAMED an output now. Updates the writing helper's RECORDINGS_ROOT to point at outputs/, all test-file doc-comment path refs, the recordings README (title, intro, GitHub blob URL example, section header), .gitignore patterns, the voice-integration CI workflow's upload path, TESTING.md fixture paths, and fixes the (pre-existing) broken link in javascript/README.md that pointed at ./recordings/README.md. Python's python/recordings/ stays for now; renaming there is a follow-up issue (filed separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t types User feedback: outputs/ should be a parent for all test-run artifact types (recordings now, traces/logs/screenshots later). Moves every demo into outputs/recordings/<demo>/ and adds a new thin outputs/README.md that documents the artifact-parent shape. The rich audio policy / per-demo coverage table stays where it belongs at outputs/recordings/README.md. Writer (tests/voice/helpers/save-demo-recording.ts) updated: RECORDINGS_ROOT now resolves to .../outputs/recordings/, so newly written recordings land in the new shape without further changes. Other ref updates: - .gitignore: every committed-demo whitelist + segments re-ignore moved under outputs/recordings/, plus a sibling re-include for the new outputs/README.md. - .github/workflows/javascript-voice-integration.yml: upload-artifact path → outputs/recordings/**. - javascript/README.md: doc link → outputs/recordings/README.md. - TESTING.md: footprint paths + du command. - All @e2e demo test docstrings (15 files): "Recording lands in outputs/recordings/<demo>/". Sanity: typecheck PASS, build PASS, tests 791/792 PASS (1 pre-existing skip, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed / PR #492) lists `interruption` in the adapter-capabilities declaration. The vitest-cucumber binding at voice-contract-surface.test.ts:177 still had the pre-71dd5ed step title (missing `interruption`), so StepAble couldn't find the matching feature step. Update the step title to match the feature file and add the live-adapter `typeof caps.interruption === "boolean"` check (the empty-adapter check on line 192 already exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eceive re-engages (#567) Root cause ---------- `ElevenLabsAgentAdapter.sendAudio()` streamed the user PCM then a fixed 16000-byte (~333ms) silence tail to coax ElevenLabs' server-side VAD into ending the turn. ElevenLabs ConvAI 2.0 detects end-of-turn with a hybrid VAD + deep-learning turn-detector (prosody, rhythm, micro-pauses), NOT a pure silence threshold, so a fixed zero-byte blob does not deterministically trip it on a scripted, non-mic stream. The greeting -> first user turn happened to work; a scripted 2nd user turn (and the post-interrupt case) intermittently never re-engaged and `receiveAudio` timed out. This is why the hosted demo was capped at one exchange and `elevenlabs_interruption` was gated off (4 prior live attempts timed out). Protocol research (verified, not guessed) ----------------------------------------- EL ConvAI exposes NO audio-flush / end-of-turn / commit client event. The complete client->server union (official Python SDK + JS SDK events.ts) is: pong | client_tool_result | conversation_initiation_client_data | feedback | contextual_update | user_message | user_activity | multimodal_message, plus the bare user_audio_chunk. `user_activity` only resets the inactivity timeout (does not commit). `user_message` ({"type":"user_message","text":...}) is the one event that deterministically forces an agent response without mic-style VAD, and it is the SDK's own sendUserMessage / send_user_message wire shape so it is server-accepted (does not 400 the socket). Fix --- Add `turnCommitMode: "text" | "silence"` (default "text") and a configurable `silenceTailBytes`. In "text" mode `sendAudio` sends ONLY an explicit `user_message` turn-commit carrying the chunk transcript (the voice runtime threads the `scenario.user(...)` script text through as the AudioChunk transcript). We deliberately do NOT also stream the raw audio in the same turn: sending user_audio_chunk + user_message together raced EL's audio ingestion against the text commit and was live-flaky (1/3 raw pass, only green via retry). Text-only commit re-engages every turn. Nothing observable is lost — the runtime records the user audio locally (recorder.recordUser, independent of this send) and EL echoes the committed text back as a user_transcript so lastUserTranscript still populates. "silence" preserves the legacy pure-audio VAD path; "text" with no transcript falls back to the silence tail. ping/pong, transcript capture, agent_response_correction, drainPendingWaiters, capability advertisement, and the wsFactory seam are all unchanged. Verification ------------ - Unit (offline, injectable wsFactory + fake socket): src/voice/__tests__/elevenlabs-turn-commit.test.ts — a scripted 2nd user turn after an agent turn drives a 2nd receiveAudio resolution, each user turn emits a user_message commit (exact shape), post-interrupt re-engages, and both silence fallbacks. 7 tests pass; full voice suite 179 pass. - Live (real EL socket, >=2 exchanges): examples/.../elevenlabs-hosted.test.ts extended to greeting -> user -> agent -> user -> agent -> judge. 3 consecutive clean runs, no retries, no `receiveAudio timed out`; 5 segments / 3 agent turns / success=true (the 2nd-turn "support hours" question now answers instead of timing out). Python parity ------------- python/scenario/voice/adapters/elevenlabs.py has the IDENTICAL silence-tail limitation and the SAME bug for scripted turn 2+. NOT fixed here: #567 is scoped to the TypeScript SDK, and a Python fix cannot be live-verified in this worktree (EL key + harness are JS-side), so shipping it would be unverified. Follow-up: port the same user_message turn-commit to the Python adapter and live-verify against python/examples/voice/elevenlabs_hosted.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fidelity trade-off worth a reviewer's eye (not a blocker): In default Scope check (verified): the repo's Untested alternative for full fidelity: the protocol note rules out audio commits because Net: the fix is correct and the right default for reliability; flagging the boundary so it's a conscious choice, not a silent regression. |
…ext-turn (#567 parity) Mirror the TypeScript fix (javascript/src/voice/adapters/elevenlabs.ts, commit b2a01a1) in the Python ElevenLabsAgentAdapter. EL ConvAI exposes NO audio-flush / end-of-turn client event, and ConvAI 2.0 end-of-turn is a hybrid VAD + deep-learning turn-detector, not a pure silence threshold. The legacy "stream audio + fixed silence tail" path therefore does NOT reliably commit a scripted, non-mic turn 2+ — the 2nd recv_audio stalled (issue #567). Changes (faithful parity with the TS adapter): - New options turn_commit_mode: Literal["text","silence"] = "text" and silence_tail_bytes: int = 16000 (snake_case per Python convention). - send_audio: in "text" mode when the chunk carries a transcript, send ONLY {"type":"user_message","text":<transcript>} (the new _send_user_message helper) — the only documented client event that deterministically forces an agent response. The raw audio is NOT also streamed (audio + text in one turn raced EL's ingestion and was live-flaky). Otherwise (silence mode, or text mode with no transcript) fall back to the legacy user_audio_chunk + silence-tail path, now using silence_tail_bytes. - Module docstring wire-protocol note updated to list user_message and the turn-commit rationale, mirroring the TS docstring. The scripted user text reaches the adapter unchanged: scenario.user("…") TTS yields AudioChunk(data=pcm, transcript=text) (voice/tts.py), threaded through extract_audio -> chunk.transcript into send_audio — exact parity with the TS chunk.transcript path. No runtime changes beyond the adapter. Verification: - Unit: tests/voice/test_elevenlabs_turn_commit.py (7 tests, all pass) — proves a scripted 2nd user turn drives a 2nd recv_audio, the exact user_message shape (type+text only) is sent, post-interrupt correction updates the transcript, and the silence fallback + silence_tail_bytes resize still work. Existing EL transport tests stay green (74 passed across turn-commit + adapters + script_steps + messages + audio_chunk). - LIVE: examples/voice/elevenlabs_hosted.py against the real EL socket, 2 user exchanges — success: True. Both user turns (incl. the turn-2 follow-up "What information do you need…") received coherent agent audio responses; no recv_audio timeout. Recorded segments confirm agent/user/agent/user/agent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
819ca28 to
180bab4
Compare
180bab4 to
f2cdf58
Compare
Why
Addresses the core of #567 (the next-turn / post-interrupt
receiveAudiotimeout). The ElevenLabs hosted ConvAI adapter timed out onreceiveAudiofor scripted next-turn (and post-interrupt) cases: EL ConvAI 2.0's hybrid VAD + DL turn-detector doesn't treat a fixed ~333ms zero-byte silence tail as a deterministic end-of-turn on a scripted (non-mic) stream, so turn 2+ intermittently never re-engaged. This is the blocker that recently made a TS voice-demo attempt abandon hosted EL ConvAI and fall back to OpenAI Realtime — it's fixable, not an EL-side dead end.What changed
user_message({"type":"user_message","text":<transcript>}) is the only client→server event that deterministically forces an agent response without mic-VAD (and is the SDK's own wire shape, so it won't 400 the socket).turnCommitMode: "text" | "silence"(default"text") + configurablesilenceTailBytes. Text mode sends only theuser_messagecommit (audio is still recorded locally; EL echoesuser_transcript, solastUserTranscript/observability stay intact)."silence"preserves the legacy path;"text"with no transcript falls back to silence.agent_response_correction,drainPendingWaiters, capabilities, thewsFactorytest seam.Test plan
src/voice/__tests__/elevenlabs-turn-commit.test.ts, injectable fake socket): scripted 2nd user turn drives a 2ndreceiveAudioresolution; exactuser_messageshape asserted; post-interrupt re-engages; both silence fallbacks. 7 pass; full voice suite 179 pass (21 files); examples typecheck clean.How I can prove I was successful
examples/vitest/tests/voice/elevenlabs-hosted.test.tsextended to greeting→user→agent→user→agent→judge.success=true, 5 segments, 3 agent / 2 user turns, 14.24s. The previously-timing-out 2nd turn now answers. (A first revision that streamed audio+text together flaked 2/3 — that drove the text-only commit; recorded here as the honest negative.)Anything surprising?
a878cc1):python/scenario/voice/adapters/elevenlabs.pyhad the identical bug (16000-byte tail, no commit); ported the sameuser_messageturn-commit +turn_commit_mode/silence_tail_bytesoptions, mirroring the TS change. Live-verified ≥2 exchanges against the real EL socket (agent answered turn 2); unit 7 pass / 74 no-regression. TS + Python ship together in this PR.elevenlabs_interruption(stillRUN_EL_INTERRUPTION=1-gated) needs live post-interrupt verification not yet done — tracked as follow-up, which is why this PR addresses rather than fully closes fix(typescript-sdk/voice): harden ElevenLabs ConvAI post-interrupt / next-turn receive #567. (elevenlabs_hostedIS extended to ≥2 exchanges here.)elevenlabs_interruption(the issue's secondary ask) — the post-interrupt path is covered by the unit test and the same commit should fix it, but I didn't live-run it. Worth a follow-up.