Name	Name	Last commit message	Last commit date
parent directory ..
public	public
src	src
.env.example	.env.example
.gitignore	.gitignore
README.md	README.md
env.d.ts	env.d.ts
index.html	index.html
package.json	package.json
tsconfig.json	tsconfig.json
vite.config.ts	vite.config.ts
wrangler.jsonc	wrangler.jsonc

Voice Agent

A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.

Uses Workers AI for all models — zero external API keys required:

STT: Deepgram Flux (@cf/deepgram/flux) by default, with a Nova 3 (@cf/deepgram/nova-3) option in the UI
TTS: Deepgram Aura (@cf/deepgram/aura-1)
Turn detection: Flux StartOfTurn / EndOfTurn events
LLM: Kimi K2.7 Code (@cf/moonshotai/kimi-k2.7-code), GPT OSS 20B, or GLM 4.7 Flash

Run it

npm install
npm run start

No API keys needed — all AI models run via the Workers AI binding.

How it works

Browser                          Durable Object (VoiceAgent)
┌──────────┐   binary WS frames   ┌──────────────────────────┐
│ Mic PCM  │ ────────────────────► │ Audio Buffer             │
│ (16kHz)  │                       │   ↓                      │
│          │                       │ STT (flux)               │
│          │                       │   ↓                      │
│          │   JSON: transcript    │   ↓                      │
│          │ ◄──────────────────── │ LLM                      │
│          │   binary: MP3 audio   │   ↓ (sentence chunking)  │
│ Speaker  │ ◄──────────────────── │ TTS (aura-1, streaming)  │
└──────────┘                       └──────────────────────────┘
              single WebSocket connection

Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
PCM streams to the Agent over the existing WebSocket connection (binary frames)
Flux detects speech start and turn completion server-side
Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
TTS audio streams back per-sentence as MP3 while the LLM is still generating
Browser decodes and plays audio through the selected speaker when supported; user can interrupt at any time

Features

Streaming TTS — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
Interruption handling — speak over the agent to cut it off mid-sentence. Flux speech-start events abort the server pipeline and stop queued browser playback; client audio-level detection remains as a fallback.
Server-side turn detection — Flux handles speech boundaries, so the example does not need client-side end-of-speech signaling to run the voice pipeline.
Speaker selection — choose an audio output device for assistant playback. Unsupported browsers keep using the system default output.
Conversation persistence — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
Agent tools — the LLM can call get_current_time, set_reminder, and get_weather during conversation.
Proactive scheduling — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
useVoiceAgent hook — the client uses the agents/voice-react hook, which encapsulates all audio infrastructure in ~10 lines of setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Voice Agent

Run it

How it works

Features

FilesExpand file tree

voice-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

voice-agent

Folders and files

parent directory

README.md

Voice Agent

Run it

How it works

Features