feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22) by Nixxx19 · Pull Request #1251 · mofa-org/mofa

Nixxx19 · 2026-03-15T06:11:52Z

feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22)

fixes #1252

Summary

This PR surfaces MoFA's existing but previously inaccessible voice pipeline. VoicePipeline, SpeechAdapterRegistry, and all three cloud speech adapters (Deepgram ASR, OpenAI Whisper, ElevenLabs TTS) were fully implemented in mofa-foundation and mofa-integrations but had no runnable example, no mofa-sdk entry point, and no documentation. This PR closes all three gaps: a new examples/voice_agent/ multi-binary crate wires mic → ASR → LLM → TTS → speaker in an interactive loop, crates/mofa-sdk/src/voice.rs re-exports the entire voice API under a single import, and docs/mofa-doc/src/guides/voice.md adds a colourful Mermaid architecture diagram, quick-start instructions, provider-swap guide, and full API reference. Three additional targeted examples cover CI-friendly file transcription, runtime adapter selection, and low-latency streaming TTS.

Pain Points Addressed

Before This PR

No Runnable Entry Point
- VoicePipeline, DeepgramAsrAdapter, ElevenLabsTtsAdapter, and OpenAiAsrAdapter existed but no code wired them together end-to-end
- Users had to manually compose all three cloud adapters, audio capture, and playback from scratch
- No main.rs demonstrated a working conversation loop
No SDK Surface
- mofa-sdk re-exported agent, tool, and persistence types but nothing from the voice layer
- Users had to reach into mofa-foundation::voice_pipeline and mofa-kernel::speech directly, bypassing the single-entry-point promise of the SDK
- No use mofa_sdk::voice::* pattern was possible
No Documentation
- docs/mofa-doc/src/guides/ had guides for monitoring, multi-agent coordination, secretary-agent, and persistence — but nothing for voice
- New users had no way to discover that a voice pipeline existed or how to configure it
- Feature flags (deepgram, elevenlabs, openai-speech) and adapter name keys were undocumented
No CI-Friendly Examples
- Every voice use case required a live microphone and speakers
- No smoke-test path existed for CI environments without audio hardware

Why This Was Needed

Feature Discoverability — A complete voice pipeline sitting behind zero documentation and no SDK re-export is effectively invisible. This PR makes it a first-class, documented, runnable feature.
Developer Experience — cargo run -p voice_agent should just work. use mofa_sdk::voice::* should give you everything. Both are now true.
CI Safety — The file_transcription example auto-generates a silent WAV when given no input, so the pipeline can be smoke-tested in CI with only API keys.
Architecture Completion — The microkernel's TtsAdapter/AsrAdapter traits in mofa-kernel, the VoicePipeline in mofa-foundation, and the cloud adapters in mofa-integrations form a complete, tested stack. This PR connects the stack to the user-facing surface (mofa-sdk + examples/).

What Was Added

`examples/voice_agent/` (New Multi-Binary Crate)

Four binaries covering the full voice use-case spectrum:

Binary	Description
`voice_agent` (`main.rs`)	Interactive mic → Deepgram ASR → OpenAI LLM → ElevenLabs TTS → speaker loop; Press Enter to record 5 s, hear the reply
`file_transcription`	WAV file → ASR → LLM → TTS → `reply.mp3`; generates silent WAV when no file given — CI-friendly
`adapter_registry`	Registers Deepgram and OpenAI Whisper; picks active adapter from `ASR_PROVIDER` env var at runtime — no recompile needed
`streaming_tts`	Streams LLM tokens via `chat_stream()`; flushes each completed sentence to TTS immediately — lower time-to-first-audio

Audio I/O (audio_io.rs):

Function	Library	Platform
`record_wav()`	`cpal`	ALSA / CoreAudio / WASAPI
`play_mp3()`	`rodio`	Cross-platform

`crates/mofa-sdk/src/voice.rs` (New)

Single-import access to the entire voice API:

use mofa_sdk::voice::{
    VoicePipeline, VoicePipelineConfig, VoicePipelineResult,
    SpeechAdapterRegistry,
    TtsAdapter, AsrAdapter,
    TtsConfig, AsrConfig,
    AudioFormat, AudioOutput, TranscriptionResult, TranscriptionSegment,
    VoiceDescriptor,
};

`docs/mofa-doc/src/guides/voice.md` (New)

Complete voice guide including:

Colourful Mermaid architecture diagram (mic → ASR → LLM → TTS → speaker with per-component colour coding)
Environment variable quick-start
VoicePipeline API reference
ASR provider swap guide (Deepgram ↔ OpenAI Whisper)
TTS provider swap guide (ElevenLabs ↔ OpenAI TTS)
SpeechAdapterRegistry usage
Feature flags table

Architecture Diagrams

Voice Pipeline Data Flow

graph LR
    MIC[Microphone]:::mic -->|WAV bytes| ASR
    subgraph VoicePipeline
        ASR[AsrAdapter]:::asr -->|transcript text| LLM[LLMProvider]:::llm
        LLM -->|reply text| TTS[TtsAdapter]:::tts
    end
    TTS -->|MP3 bytes| SPK[Speaker]:::spk
    classDef mic  fill:#4A90D9,stroke:#2C5F8A,color:#fff
    classDef asr  fill:#7B68EE,stroke:#4B3DB5,color:#fff
    classDef llm  fill:#F5A623,stroke:#C47D0E,color:#fff
    classDef tts  fill:#7ED321,stroke:#4E8A0E,color:#fff
    classDef spk  fill:#E74C3C,stroke:#A93226,color:#fff

Streaming TTS Flow (sentence-boundary flushing)

Runtime Adapter Selection (`SpeechAdapterRegistry`)

Files Changed

File	Action	Description
`examples/voice_agent/Cargo.toml`	Created	4 `[[bin]]` entries; deps: mofa-kernel, mofa-foundation, mofa-integrations (deepgram + elevenlabs + openai-speech), cpal, hound, rodio
`examples/voice_agent/src/main.rs`	Created	Interactive mic conversation loop (181 lines)
`examples/voice_agent/src/audio_io.rs`	Created	`record_wav()` + `play_mp3()` cross-platform audio I/O (112 lines)
`examples/voice_agent/src/file_transcription.rs`	Created	WAV file → ASR → LLM → TTS → `reply.mp3`; silent WAV fallback for CI
`examples/voice_agent/src/adapter_registry.rs`	Created	Runtime ASR provider swap via `SpeechAdapterRegistry` + env var
`examples/voice_agent/src/streaming_tts.rs`	Created	Sentence-boundary streaming TTS with `chat_stream()`
`examples/Cargo.toml`	Edited	Added `voice_agent` as workspace member
`crates/mofa-sdk/src/lib.rs`	Edited	Added `pub mod voice`
`crates/mofa-sdk/src/voice.rs`	Created	Re-exports all voice-related public types from kernel and foundation
`docs/mofa-doc/src/guides/voice.md`	Created	Complete voice guide with Mermaid diagrams, quick-start, API reference

Feature Flags Used

Flag	Crate	Enables
`deepgram`	`mofa-integrations`	`DeepgramAsrAdapter` — Deepgram nova-2 ASR
`elevenlabs`	`mofa-integrations`	`ElevenLabsTtsAdapter` — ElevenLabs v1 TTS
`openai-speech`	`mofa-integrations`	`OpenAiAsrAdapter` (Whisper) + `OpenAiTtsAdapter`

Testing

Existing Tests — All Passing

Location	Count	Coverage
`mofa-kernel/src/speech.rs`	15 tests	`TtsAdapter`, `AsrAdapter` trait contracts, `TtsConfig`, `AsrConfig`, `AudioFormat`
`mofa-foundation/src/voice_pipeline.rs`	4 tests	End-to-end mock pipeline: ASR → LLM → TTS; error propagation; empty transcript handling
`mofa-foundation/src/speech_registry.rs`	7 tests	`SpeechAdapterRegistry` register, list, default selection, missing adapter

New Example Smoke Tests

file_transcription with no args: generates 1-second silent WAV, runs full ASR → LLM → TTS pipeline, writes reply.mp3 — requires only API keys, no audio hardware
adapter_registry with ASR_PROVIDER=deepgram and ASR_PROVIDER=openai-whisper: verifies both registry paths without recompiling

Key Design Decisions

Decision	Rationale
`voice.rs` as a dedicated SDK module (not inline in `lib.rs`)	Keeps the SDK root clean; voice is a large enough surface to warrant its own file
`file_transcription` generates silent WAV internally	CI smoke tests must not require audio hardware; silent WAV triggers the ASR → LLM → TTS path even when ASR returns empty (graceful no-op)
Sentence-boundary flushing in `streaming_tts` (not word-boundary)	Word-boundary flushing produces too many tiny TTS requests; sentence-boundary gives natural phrasing and acceptable latency reduction
Single `warn-and-continue` on streaming TTS sentence failure	A failed TTS sentence must never abort the rest of the response
`SpeechAdapterRegistry` keys come from `adapter.name()`	Avoids string duplication; the adapter is the source of truth for its own identifier

Breaking Changes

None. All additions are net-new. No existing APIs were modified. The new pub mod voice in mofa-sdk adds to the public surface without changing anything existing.

Checklist

Wire mic → Deepgram ASR → OpenAI LLM → ElevenLabs TTS → speaker in a complete interactive conversation loop. New files: - examples/voice_agent/Cargo.toml — deps: mofa-kernel, mofa-foundation, mofa-integrations (deepgram + elevenlabs features), cpal, hound, rodio - examples/voice_agent/src/audio_io.rs — record_wav() captures PCM from default mic via cpal and encodes to WAV bytes with hound; play_mp3() decodes and plays audio via rodio - examples/voice_agent/src/main.rs — validates env vars, builds adapters, runs the ASR → LLM → TTS loop, printing each turn's transcript and reply Updated: - examples/Cargo.toml — added "voice_agent" member - examples/Cargo.lock — updated for cpal 0.15, hound 3.5, rodio 0.17 Usage: export OPENAI_API_KEY=sk-... export DEEPGRAM_API_KEY=... export ELEVENLABS_API_KEY=... cargo run -p voice_agent

…Task 22) Add a public `voice` module to mofa-sdk that re-exports the full voice pipeline API so users have a single import path: use mofa_sdk::voice::{VoicePipeline, VoicePipelineConfig, ...}; Re-exported types: - VoicePipeline, VoicePipelineConfig, VoicePipelineResult (mofa-foundation) - SpeechAdapterRegistry (mofa-foundation) - TtsAdapter, AsrAdapter, AudioFormat, AudioOutput, TranscriptionResult, TranscriptionSegment, TtsConfig, AsrConfig, VoiceDescriptor (mofa-kernel) Also add docs/mofa-doc/src/guides/voice.md covering: - Architecture Mermaid diagram (mic → ASR → LLM → TTS → speaker) - Quick-start wiring example - VoicePipelineConfig builder reference - Adapter selection table (Deepgram, Whisper, ElevenLabs, OpenAI TTS) - SpeechAdapterRegistry runtime usage - Feature flags (deepgram, elevenlabs, openai-speech) - Link to examples/voice_agent runnable demo

file_transcription — CI-friendly: reads a WAV file from disk (or generates silent test audio), transcribes with Deepgram ASR, sends to OpenAI LLM, synthesises with ElevenLabs TTS, saves reply to reply.mp3. Requires no microphone or speakers. cargo run --bin file_transcription -- path/to/audio.wav adapter_registry — demonstrates SpeechAdapterRegistry: registers both Deepgram ("deepgram") and OpenAI Whisper ("openai-whisper") ASR adapters and selects the active one via ASR_PROVIDER env var at runtime with no code changes. ASR_PROVIDER=openai-whisper cargo run --bin adapter_registry streaming_tts — lower time-to-first-audio: streams LLM tokens via chat_stream(), buffers until a sentence boundary (. ! ?) is detected, and immediately synthesises each completed sentence with ElevenLabs TTS. All MP3 chunks are concatenated and saved to streaming_reply.mp3. VOICE_PROMPT="Tell me about Rust." cargo run --bin streaming_tts Also adds openai-speech and futures deps to voice_agent/Cargo.toml.

Nixxx19 added 5 commits March 15, 2026 04:21

fix(docs): simplify voice.md Mermaid diagram to avoid escaping issues

c41873f

fix(docs): add colour styling to voice.md Mermaid architecture diagram

5e6b78e

Nixxx19 mentioned this pull request Mar 15, 2026

feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22) #1252

Open

15 tasks

Nixxx19 added 2 commits March 17, 2026 18:42

Merge branch 'main' into feat/task-22-voice-integration

62a19d0

Merge branch 'main' into feat/task-22-voice-integration

1b410b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251

feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251
Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Nixxx19:feat/task-22-voice-integration

Nixxx19 commented Mar 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nixxx19 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22)

fixes #1252

Summary

Pain Points Addressed

Before This PR

Why This Was Needed

What Was Added

examples/voice_agent/ (New Multi-Binary Crate)

crates/mofa-sdk/src/voice.rs (New)

docs/mofa-doc/src/guides/voice.md (New)

Architecture Diagrams

Voice Pipeline Data Flow

Streaming TTS Flow (sentence-boundary flushing)

Runtime Adapter Selection (SpeechAdapterRegistry)

Files Changed

Feature Flags Used

Testing

Existing Tests — All Passing

New Example Smoke Tests

Key Design Decisions

Breaking Changes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nixxx19 commented Mar 15, 2026 •

edited

Loading

`examples/voice_agent/` (New Multi-Binary Crate)

`crates/mofa-sdk/src/voice.rs` (New)

`docs/mofa-doc/src/guides/voice.md` (New)

Runtime Adapter Selection (`SpeechAdapterRegistry`)