Skip to content

feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251

Open
Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Nixxx19:feat/task-22-voice-integration
Open

feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251
Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Nixxx19:feat/task-22-voice-integration

Conversation

@Nixxx19
Copy link
Copy Markdown
Contributor

@Nixxx19 Nixxx19 commented Mar 15, 2026

feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22)

fixes #1252


Summary

This PR surfaces MoFA's existing but previously inaccessible voice pipeline. VoicePipeline, SpeechAdapterRegistry, and all three cloud speech adapters (Deepgram ASR, OpenAI Whisper, ElevenLabs TTS) were fully implemented in mofa-foundation and mofa-integrations but had no runnable example, no mofa-sdk entry point, and no documentation. This PR closes all three gaps: a new examples/voice_agent/ multi-binary crate wires mic → ASR → LLM → TTS → speaker in an interactive loop, crates/mofa-sdk/src/voice.rs re-exports the entire voice API under a single import, and docs/mofa-doc/src/guides/voice.md adds a colourful Mermaid architecture diagram, quick-start instructions, provider-swap guide, and full API reference. Three additional targeted examples cover CI-friendly file transcription, runtime adapter selection, and low-latency streaming TTS.


Pain Points Addressed

Before This PR

  1. No Runnable Entry Point

    • VoicePipeline, DeepgramAsrAdapter, ElevenLabsTtsAdapter, and OpenAiAsrAdapter existed but no code wired them together end-to-end
    • Users had to manually compose all three cloud adapters, audio capture, and playback from scratch
    • No main.rs demonstrated a working conversation loop
  2. No SDK Surface

    • mofa-sdk re-exported agent, tool, and persistence types but nothing from the voice layer
    • Users had to reach into mofa-foundation::voice_pipeline and mofa-kernel::speech directly, bypassing the single-entry-point promise of the SDK
    • No use mofa_sdk::voice::* pattern was possible
  3. No Documentation

    • docs/mofa-doc/src/guides/ had guides for monitoring, multi-agent coordination, secretary-agent, and persistence — but nothing for voice
    • New users had no way to discover that a voice pipeline existed or how to configure it
    • Feature flags (deepgram, elevenlabs, openai-speech) and adapter name keys were undocumented
  4. No CI-Friendly Examples

    • Every voice use case required a live microphone and speakers
    • No smoke-test path existed for CI environments without audio hardware

Why This Was Needed

  1. Feature Discoverability — A complete voice pipeline sitting behind zero documentation and no SDK re-export is effectively invisible. This PR makes it a first-class, documented, runnable feature.
  2. Developer Experiencecargo run -p voice_agent should just work. use mofa_sdk::voice::* should give you everything. Both are now true.
  3. CI Safety — The file_transcription example auto-generates a silent WAV when given no input, so the pipeline can be smoke-tested in CI with only API keys.
  4. Architecture Completion — The microkernel's TtsAdapter/AsrAdapter traits in mofa-kernel, the VoicePipeline in mofa-foundation, and the cloud adapters in mofa-integrations form a complete, tested stack. This PR connects the stack to the user-facing surface (mofa-sdk + examples/).

What Was Added

examples/voice_agent/ (New Multi-Binary Crate)

Four binaries covering the full voice use-case spectrum:

Binary Description
voice_agent (main.rs) Interactive mic → Deepgram ASR → OpenAI LLM → ElevenLabs TTS → speaker loop; Press Enter to record 5 s, hear the reply
file_transcription WAV file → ASR → LLM → TTS → reply.mp3; generates silent WAV when no file given — CI-friendly
adapter_registry Registers Deepgram and OpenAI Whisper; picks active adapter from ASR_PROVIDER env var at runtime — no recompile needed
streaming_tts Streams LLM tokens via chat_stream(); flushes each completed sentence to TTS immediately — lower time-to-first-audio

Audio I/O (audio_io.rs):

Function Library Platform
record_wav() cpal ALSA / CoreAudio / WASAPI
play_mp3() rodio Cross-platform

crates/mofa-sdk/src/voice.rs (New)

Single-import access to the entire voice API:

use mofa_sdk::voice::{
    VoicePipeline, VoicePipelineConfig, VoicePipelineResult,
    SpeechAdapterRegistry,
    TtsAdapter, AsrAdapter,
    TtsConfig, AsrConfig,
    AudioFormat, AudioOutput, TranscriptionResult, TranscriptionSegment,
    VoiceDescriptor,
};

docs/mofa-doc/src/guides/voice.md (New)

Complete voice guide including:

  • Colourful Mermaid architecture diagram (mic → ASR → LLM → TTS → speaker with per-component colour coding)
  • Environment variable quick-start
  • VoicePipeline API reference
  • ASR provider swap guide (Deepgram ↔ OpenAI Whisper)
  • TTS provider swap guide (ElevenLabs ↔ OpenAI TTS)
  • SpeechAdapterRegistry usage
  • Feature flags table

Architecture Diagrams

Voice Pipeline Data Flow

graph LR
    MIC[Microphone]:::mic -->|WAV bytes| ASR
    subgraph VoicePipeline
        ASR[AsrAdapter]:::asr -->|transcript text| LLM[LLMProvider]:::llm
        LLM -->|reply text| TTS[TtsAdapter]:::tts
    end
    TTS -->|MP3 bytes| SPK[Speaker]:::spk
    classDef mic  fill:#4A90D9,stroke:#2C5F8A,color:#fff
    classDef asr  fill:#7B68EE,stroke:#4B3DB5,color:#fff
    classDef llm  fill:#F5A623,stroke:#C47D0E,color:#fff
    classDef tts  fill:#7ED321,stroke:#4E8A0E,color:#fff
    classDef spk  fill:#E74C3C,stroke:#A93226,color:#fff
Loading

Streaming TTS Flow (sentence-boundary flushing)

Screenshot 2026-03-21 at 12 49 32 PM

Runtime Adapter Selection (SpeechAdapterRegistry)

Screenshot 2026-03-21 at 12 50 39 PM

Files Changed

File Action Description
examples/voice_agent/Cargo.toml Created 4 [[bin]] entries; deps: mofa-kernel, mofa-foundation, mofa-integrations (deepgram + elevenlabs + openai-speech), cpal, hound, rodio
examples/voice_agent/src/main.rs Created Interactive mic conversation loop (181 lines)
examples/voice_agent/src/audio_io.rs Created record_wav() + play_mp3() cross-platform audio I/O (112 lines)
examples/voice_agent/src/file_transcription.rs Created WAV file → ASR → LLM → TTS → reply.mp3; silent WAV fallback for CI
examples/voice_agent/src/adapter_registry.rs Created Runtime ASR provider swap via SpeechAdapterRegistry + env var
examples/voice_agent/src/streaming_tts.rs Created Sentence-boundary streaming TTS with chat_stream()
examples/Cargo.toml Edited Added voice_agent as workspace member
crates/mofa-sdk/src/lib.rs Edited Added pub mod voice
crates/mofa-sdk/src/voice.rs Created Re-exports all voice-related public types from kernel and foundation
docs/mofa-doc/src/guides/voice.md Created Complete voice guide with Mermaid diagrams, quick-start, API reference

Feature Flags Used

Flag Crate Enables
deepgram mofa-integrations DeepgramAsrAdapter — Deepgram nova-2 ASR
elevenlabs mofa-integrations ElevenLabsTtsAdapter — ElevenLabs v1 TTS
openai-speech mofa-integrations OpenAiAsrAdapter (Whisper) + OpenAiTtsAdapter

Testing

Existing Tests — All Passing

Location Count Coverage
mofa-kernel/src/speech.rs 15 tests TtsAdapter, AsrAdapter trait contracts, TtsConfig, AsrConfig, AudioFormat
mofa-foundation/src/voice_pipeline.rs 4 tests End-to-end mock pipeline: ASR → LLM → TTS; error propagation; empty transcript handling
mofa-foundation/src/speech_registry.rs 7 tests SpeechAdapterRegistry register, list, default selection, missing adapter

New Example Smoke Tests

  • file_transcription with no args: generates 1-second silent WAV, runs full ASR → LLM → TTS pipeline, writes reply.mp3 — requires only API keys, no audio hardware
  • adapter_registry with ASR_PROVIDER=deepgram and ASR_PROVIDER=openai-whisper: verifies both registry paths without recompiling

Key Design Decisions

Decision Rationale
voice.rs as a dedicated SDK module (not inline in lib.rs) Keeps the SDK root clean; voice is a large enough surface to warrant its own file
file_transcription generates silent WAV internally CI smoke tests must not require audio hardware; silent WAV triggers the ASR → LLM → TTS path even when ASR returns empty (graceful no-op)
Sentence-boundary flushing in streaming_tts (not word-boundary) Word-boundary flushing produces too many tiny TTS requests; sentence-boundary gives natural phrasing and acceptable latency reduction
Single warn-and-continue on streaming TTS sentence failure A failed TTS sentence must never abort the rest of the response
SpeechAdapterRegistry keys come from adapter.name() Avoids string duplication; the adapter is the source of truth for its own identifier

Breaking Changes

None. All additions are net-new. No existing APIs were modified. The new pub mod voice in mofa-sdk adds to the public surface without changing anything existing.


Checklist

  • Follows MoFA microkernel architecture patterns (traits in kernel, impls in foundation, examples separate)
  • examples/voice_agent has 4 distinct, well-documented binaries
  • mofa-sdk::voice re-exports complete public voice API
  • Existing 26 voice-related tests still pass
  • CI-friendly smoke test path (silent WAV, no audio hardware)
  • No breaking changes to existing APIs
  • Audio deps (cpal, hound, rodio) scoped to example crate only — no library crate pollution
  • All code comments and doc strings in English (per CLAUDE.md)
  • Documentation updated (docs/mofa-doc/src/guides/voice.md)
  • Feature flags documented in guide and Cargo.toml

Nixxx19 added 5 commits March 15, 2026 04:21
Wire mic → Deepgram ASR → OpenAI LLM → ElevenLabs TTS → speaker in a
complete interactive conversation loop.

New files:
- examples/voice_agent/Cargo.toml — deps: mofa-kernel, mofa-foundation,
  mofa-integrations (deepgram + elevenlabs features), cpal, hound, rodio
- examples/voice_agent/src/audio_io.rs — record_wav() captures PCM from
  default mic via cpal and encodes to WAV bytes with hound; play_mp3()
  decodes and plays audio via rodio
- examples/voice_agent/src/main.rs — validates env vars, builds adapters,
  runs the ASR → LLM → TTS loop, printing each turn's transcript and reply

Updated:
- examples/Cargo.toml — added "voice_agent" member
- examples/Cargo.lock — updated for cpal 0.15, hound 3.5, rodio 0.17

Usage:
  export OPENAI_API_KEY=sk-...
  export DEEPGRAM_API_KEY=...
  export ELEVENLABS_API_KEY=...
  cargo run -p voice_agent
…Task 22)

Add a public `voice` module to mofa-sdk that re-exports the full voice
pipeline API so users have a single import path:

  use mofa_sdk::voice::{VoicePipeline, VoicePipelineConfig, ...};

Re-exported types:
- VoicePipeline, VoicePipelineConfig, VoicePipelineResult (mofa-foundation)
- SpeechAdapterRegistry (mofa-foundation)
- TtsAdapter, AsrAdapter, AudioFormat, AudioOutput, TranscriptionResult,
  TranscriptionSegment, TtsConfig, AsrConfig, VoiceDescriptor (mofa-kernel)

Also add docs/mofa-doc/src/guides/voice.md covering:
- Architecture Mermaid diagram (mic → ASR → LLM → TTS → speaker)
- Quick-start wiring example
- VoicePipelineConfig builder reference
- Adapter selection table (Deepgram, Whisper, ElevenLabs, OpenAI TTS)
- SpeechAdapterRegistry runtime usage
- Feature flags (deepgram, elevenlabs, openai-speech)
- Link to examples/voice_agent runnable demo
file_transcription — CI-friendly: reads a WAV file from disk (or generates
  silent test audio), transcribes with Deepgram ASR, sends to OpenAI LLM,
  synthesises with ElevenLabs TTS, saves reply to reply.mp3. Requires no
  microphone or speakers.

  cargo run --bin file_transcription -- path/to/audio.wav

adapter_registry — demonstrates SpeechAdapterRegistry: registers both
  Deepgram ("deepgram") and OpenAI Whisper ("openai-whisper") ASR adapters
  and selects the active one via ASR_PROVIDER env var at runtime with no
  code changes.

  ASR_PROVIDER=openai-whisper cargo run --bin adapter_registry

streaming_tts — lower time-to-first-audio: streams LLM tokens via
  chat_stream(), buffers until a sentence boundary (. ! ?) is detected, and
  immediately synthesises each completed sentence with ElevenLabs TTS. All
  MP3 chunks are concatenated and saved to streaming_reply.mp3.

  VOICE_PROMPT="Tell me about Rust." cargo run --bin streaming_tts

Also adds openai-speech and futures deps to voice_agent/Cargo.toml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22)

1 participant