A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.
Uses Workers AI for all models — zero external API keys required:
- STT: Deepgram Flux (
@cf/deepgram/flux) by default, with a Nova 3 (@cf/deepgram/nova-3) option in the UI - TTS: Deepgram Aura (
@cf/deepgram/aura-1) - Turn detection: Flux
StartOfTurn/EndOfTurnevents - LLM: Kimi K2.7 Code (
@cf/moonshotai/kimi-k2.7-code), GPT OSS 20B, or GLM 4.7 Flash
npm install
npm run startNo API keys needed — all AI models run via the Workers AI binding.
Browser Durable Object (VoiceAgent)
┌──────────┐ binary WS frames ┌──────────────────────────┐
│ Mic PCM │ ────────────────────► │ Audio Buffer │
│ (16kHz) │ │ ↓ │
│ │ │ STT (flux) │
│ │ │ ↓ │
│ │ JSON: transcript │ ↓ │
│ │ ◄──────────────────── │ LLM │
│ │ binary: MP3 audio │ ↓ (sentence chunking) │
│ Speaker │ ◄──────────────────── │ TTS (aura-1, streaming) │
└──────────┘ └──────────────────────────┘
single WebSocket connection
- Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
- PCM streams to the Agent over the existing WebSocket connection (binary frames)
- Flux detects speech start and turn completion server-side
- Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
- TTS audio streams back per-sentence as MP3 while the LLM is still generating
- Browser decodes and plays audio through the selected speaker when supported; user can interrupt at any time
- Streaming TTS — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
- Interruption handling — speak over the agent to cut it off mid-sentence. Flux speech-start events abort the server pipeline and stop queued browser playback; client audio-level detection remains as a fallback.
- Server-side turn detection — Flux handles speech boundaries, so the example does not need client-side end-of-speech signaling to run the voice pipeline.
- Speaker selection — choose an audio output device for assistant playback. Unsupported browsers keep using the system default output.
- Conversation persistence — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
- Agent tools — the LLM can call
get_current_time,set_reminder, andget_weatherduring conversation. - Proactive scheduling — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
useVoiceAgenthook — the client uses theagents/voice-reacthook, which encapsulates all audio infrastructure in ~10 lines of setup.