Pipecat is an open-source, vendor-neutral framework for building real-time voice (and video) AI applications.
This repository contains an example of a voice agent running with all local models on macOS. On an M-series mac, you can achieve voice-to-voice latency of <800 ms with relatively strong models.
The server/bot.py file uses these models:
- Silero VAD
- smart-turn v2
- MLX Whisper
- Gemma3n 4B
- Kokoro TTS
But you can swap any of them out for other models, or completely reconfigure the pipeline. It's easy to add tool calling, MCP server integrations, use parallel pipelines to do async inference alongside the voice conversations, add custom processing steps, configure interrupt handling to work differently, etc.
The bot and web client here communicate using a low-latency, local, serverless WebRTC connection. For more information on serverless WebRTC, see the Pipecat SmallWebRTCTransport docs and this article. You could switch over to a different Pipecat transport (for example, a WebSocket-based transport), but WebRTC is the best choice for realtime audio.
For a deep dive into voice AI, including network transport, optimizing for latency, and notes on designing tool calling and complex workflows, see the Voice AI & Voice Agents Illustrated Guide.
Silero VAD and MLX Whisper run inside the Pipecat process. When the agent code starts, it will need to download model weights that aren't already cached, so first startup can take some time.
The LLM service in this bot uses the OpenAI-compatible chat completion HTTP API. So you will need to run a local OpenAI-compatible LLM server.
One easy, high-performance, way to run a local LLM server on macOS is LM Studio. From inside the LM Studio graphical interface, go to the "Developer" tab on the far left to start an HTTP server.
The core voice agent code lives in a single file: server/bot.py. There's one custom service here that's not included in Pipecat core: we implemented a local MLX-Audio frame processor on top of the excellent mlx-audio library.
Note that the first time you start the bot it will take some time to initialize the three models. It can be 30 seconds or more before the bot is fully ready to go. Subsequent startups will be much faster.
It's not a bad idea to run a quick mlx-audio.generate
process from the command line before you run the bot the first time, so you're not waiting for a relatively bug HuggingFace model download for the voice model.
mlx-audio.generate --model "Marvis-AI/marvis-tts-250m-v0.1" --text "Hello, I'm Pipecat!" --output "output.wav"
# or
mlx-audio.generate --model "mlx-community/Kokoro-82M-bf16" --text "Hello, I'm Pipecat!" --output "output.wav"
cd server/
If you're using uv
uv run bot.py
If you're using pip
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python bot.py
After you run the first time and have all the models cached, you can set the HF_HUB_OFFLINE environment variable to prevent the Hugging Face libraries from going to the network and checking for model updates. This makes the initial bot startup and first conversation turn a lot faster.
HF_HUB_OFFLINE=1 uv run bot.py
The web client is a React app. You can connect to your local macOS agent using any client that can negotiate a serverless WebRTC connection. The client in this repo is based on voice-ui-kit and just uses that library's standard debug console template.
cd client/
npm i
npm run dev
# Navigate to URL shown in terminal in your web browser