Skip to content

feat(voice/typescript): random_interruptions demo — transport upgrade for real mid-stream audio cut-off #583

@drewdrewthis

Description

@drewdrewthis

Context

Follow-up from #561. The structural interrupt-scheduling fix landed: voiceProceed({ interruptions }) now schedules AGENT pre-step + fires the barge-in via inline TTS, honoring delayRange. The user_interrupt event fires with outcome === "fired_after_speech", the cut-off-boundary transcriptTruncated label fires correctly, and the unit test (proceed-interrupt.test.ts) locks it in.

But the bundled Pipecat stub bot in python/examples/voice/_bot/bot.py generates TTS in a burst (~50 ms of wall time for a several-second reply) and streams faster than realtime. By the time adapter.interrupt() runs (even with delayRange: [0.5, 2.0]), the bot has already sent all frames. The random_interruptions demo recording ends up 10 s with 4 segments — the structural fix is correct but the bot can't demonstrate real audio truncation.

Additionally, the real judgeAgent fires inside voiceProceed and concludes success=false after one truncated agent turn (can't satisfy criteria), collapsing the demo to 4 segments regardless.

Options for the follow-up

  1. Re-target random_interruptions to a realtime-streaming transport (OpenAI Realtime or Gemini Live). They support real server-side cancel that prevents late-frame delivery. Loses TS-Python parity for this specific demo (Python uses Pipecat).
  2. Modify the bundled Pipecat bot to stream TTS at realtime pace — requires Python edits + changes the reference implementation.
  3. Suppress judgeAgent during voiceProceed — let proceed exhaust all turns before judge fires. Helps both demos.

Recommend (3) first (cheap, helps multiple demos), then (1) for the audio-cut demonstration if still needed.

Current state at #561 merge

  • random-interruptions.test.ts assertions encode what the Pipecat bot CAN prove (interrupt fires + canned-phrase strategy + fired_after_speech outcome + truncation label + recovery + multi-turn). The "median-shorter" assertion (added then dropped) is intentionally omitted — see commit 0b9dd1e.
  • The recording (javascript/recordings/random_interruptions/full.wav, 10 s, 4 segments) is honest about the bot's limit but thin as a demo.
  • gemini_live_interruption already demonstrates real mid-stream audio cut-off on a realtime transport.

Acceptance

A random_interruptions recording that shows: 30 s+ duration, 5+ segments, agent says >= 1.5 s of substantive audio before each barge-in, agent recovers with non-empty audio after barge-ins, multi-turn conversation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions