feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251
Open
Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Open
feat(voice): end-to-end voice pipeline example, SDK exports, and docs (Task 22)#1251Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Nixxx19 wants to merge 7 commits intomofa-org:mainfrom
Conversation
Wire mic → Deepgram ASR → OpenAI LLM → ElevenLabs TTS → speaker in a complete interactive conversation loop. New files: - examples/voice_agent/Cargo.toml — deps: mofa-kernel, mofa-foundation, mofa-integrations (deepgram + elevenlabs features), cpal, hound, rodio - examples/voice_agent/src/audio_io.rs — record_wav() captures PCM from default mic via cpal and encodes to WAV bytes with hound; play_mp3() decodes and plays audio via rodio - examples/voice_agent/src/main.rs — validates env vars, builds adapters, runs the ASR → LLM → TTS loop, printing each turn's transcript and reply Updated: - examples/Cargo.toml — added "voice_agent" member - examples/Cargo.lock — updated for cpal 0.15, hound 3.5, rodio 0.17 Usage: export OPENAI_API_KEY=sk-... export DEEPGRAM_API_KEY=... export ELEVENLABS_API_KEY=... cargo run -p voice_agent
…Task 22)
Add a public `voice` module to mofa-sdk that re-exports the full voice
pipeline API so users have a single import path:
use mofa_sdk::voice::{VoicePipeline, VoicePipelineConfig, ...};
Re-exported types:
- VoicePipeline, VoicePipelineConfig, VoicePipelineResult (mofa-foundation)
- SpeechAdapterRegistry (mofa-foundation)
- TtsAdapter, AsrAdapter, AudioFormat, AudioOutput, TranscriptionResult,
TranscriptionSegment, TtsConfig, AsrConfig, VoiceDescriptor (mofa-kernel)
Also add docs/mofa-doc/src/guides/voice.md covering:
- Architecture Mermaid diagram (mic → ASR → LLM → TTS → speaker)
- Quick-start wiring example
- VoicePipelineConfig builder reference
- Adapter selection table (Deepgram, Whisper, ElevenLabs, OpenAI TTS)
- SpeechAdapterRegistry runtime usage
- Feature flags (deepgram, elevenlabs, openai-speech)
- Link to examples/voice_agent runnable demo
file_transcription — CI-friendly: reads a WAV file from disk (or generates
silent test audio), transcribes with Deepgram ASR, sends to OpenAI LLM,
synthesises with ElevenLabs TTS, saves reply to reply.mp3. Requires no
microphone or speakers.
cargo run --bin file_transcription -- path/to/audio.wav
adapter_registry — demonstrates SpeechAdapterRegistry: registers both
Deepgram ("deepgram") and OpenAI Whisper ("openai-whisper") ASR adapters
and selects the active one via ASR_PROVIDER env var at runtime with no
code changes.
ASR_PROVIDER=openai-whisper cargo run --bin adapter_registry
streaming_tts — lower time-to-first-audio: streams LLM tokens via
chat_stream(), buffers until a sentence boundary (. ! ?) is detected, and
immediately synthesises each completed sentence with ElevenLabs TTS. All
MP3 chunks are concatenated and saved to streaming_reply.mp3.
VOICE_PROMPT="Tell me about Rust." cargo run --bin streaming_tts
Also adds openai-speech and futures deps to voice_agent/Cargo.toml.
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(voice): End-to-End Voice Pipeline — Example, SDK Exports & Docs (Task 22)
fixes #1252
Summary
This PR surfaces MoFA's existing but previously inaccessible voice pipeline.
VoicePipeline,SpeechAdapterRegistry, and all three cloud speech adapters (Deepgram ASR, OpenAI Whisper, ElevenLabs TTS) were fully implemented inmofa-foundationandmofa-integrationsbut had no runnable example, nomofa-sdkentry point, and no documentation. This PR closes all three gaps: a newexamples/voice_agent/multi-binary crate wires mic → ASR → LLM → TTS → speaker in an interactive loop,crates/mofa-sdk/src/voice.rsre-exports the entire voice API under a single import, anddocs/mofa-doc/src/guides/voice.mdadds a colourful Mermaid architecture diagram, quick-start instructions, provider-swap guide, and full API reference. Three additional targeted examples cover CI-friendly file transcription, runtime adapter selection, and low-latency streaming TTS.Pain Points Addressed
Before This PR
No Runnable Entry Point
VoicePipeline,DeepgramAsrAdapter,ElevenLabsTtsAdapter, andOpenAiAsrAdapterexisted but no code wired them together end-to-endmain.rsdemonstrated a working conversation loopNo SDK Surface
mofa-sdkre-exported agent, tool, and persistence types but nothing from the voice layermofa-foundation::voice_pipelineandmofa-kernel::speechdirectly, bypassing the single-entry-point promise of the SDKuse mofa_sdk::voice::*pattern was possibleNo Documentation
docs/mofa-doc/src/guides/had guides for monitoring, multi-agent coordination, secretary-agent, and persistence — but nothing for voicedeepgram,elevenlabs,openai-speech) and adapter name keys were undocumentedNo CI-Friendly Examples
Why This Was Needed
cargo run -p voice_agentshould just work.use mofa_sdk::voice::*should give you everything. Both are now true.file_transcriptionexample auto-generates a silent WAV when given no input, so the pipeline can be smoke-tested in CI with only API keys.TtsAdapter/AsrAdaptertraits inmofa-kernel, theVoicePipelineinmofa-foundation, and the cloud adapters inmofa-integrationsform a complete, tested stack. This PR connects the stack to the user-facing surface (mofa-sdk+examples/).What Was Added
examples/voice_agent/(New Multi-Binary Crate)Four binaries covering the full voice use-case spectrum:
voice_agent(main.rs)file_transcriptionreply.mp3; generates silent WAV when no file given — CI-friendlyadapter_registryASR_PROVIDERenv var at runtime — no recompile neededstreaming_ttschat_stream(); flushes each completed sentence to TTS immediately — lower time-to-first-audioAudio I/O (
audio_io.rs):record_wav()cpalplay_mp3()rodiocrates/mofa-sdk/src/voice.rs(New)Single-import access to the entire voice API:
docs/mofa-doc/src/guides/voice.md(New)Complete voice guide including:
VoicePipelineAPI referenceSpeechAdapterRegistryusageArchitecture Diagrams
Voice Pipeline Data Flow
graph LR MIC[Microphone]:::mic -->|WAV bytes| ASR subgraph VoicePipeline ASR[AsrAdapter]:::asr -->|transcript text| LLM[LLMProvider]:::llm LLM -->|reply text| TTS[TtsAdapter]:::tts end TTS -->|MP3 bytes| SPK[Speaker]:::spk classDef mic fill:#4A90D9,stroke:#2C5F8A,color:#fff classDef asr fill:#7B68EE,stroke:#4B3DB5,color:#fff classDef llm fill:#F5A623,stroke:#C47D0E,color:#fff classDef tts fill:#7ED321,stroke:#4E8A0E,color:#fff classDef spk fill:#E74C3C,stroke:#A93226,color:#fffStreaming TTS Flow (sentence-boundary flushing)
Runtime Adapter Selection (
SpeechAdapterRegistry)Files Changed
examples/voice_agent/Cargo.toml[[bin]]entries; deps: mofa-kernel, mofa-foundation, mofa-integrations (deepgram + elevenlabs + openai-speech), cpal, hound, rodioexamples/voice_agent/src/main.rsexamples/voice_agent/src/audio_io.rsrecord_wav()+play_mp3()cross-platform audio I/O (112 lines)examples/voice_agent/src/file_transcription.rsreply.mp3; silent WAV fallback for CIexamples/voice_agent/src/adapter_registry.rsSpeechAdapterRegistry+ env varexamples/voice_agent/src/streaming_tts.rschat_stream()examples/Cargo.tomlvoice_agentas workspace membercrates/mofa-sdk/src/lib.rspub mod voicecrates/mofa-sdk/src/voice.rsdocs/mofa-doc/src/guides/voice.mdFeature Flags Used
deepgrammofa-integrationsDeepgramAsrAdapter— Deepgram nova-2 ASRelevenlabsmofa-integrationsElevenLabsTtsAdapter— ElevenLabs v1 TTSopenai-speechmofa-integrationsOpenAiAsrAdapter(Whisper) +OpenAiTtsAdapterTesting
Existing Tests — All Passing
mofa-kernel/src/speech.rsTtsAdapter,AsrAdaptertrait contracts,TtsConfig,AsrConfig,AudioFormatmofa-foundation/src/voice_pipeline.rsmofa-foundation/src/speech_registry.rsSpeechAdapterRegistryregister, list, default selection, missing adapterNew Example Smoke Tests
file_transcriptionwith no args: generates 1-second silent WAV, runs full ASR → LLM → TTS pipeline, writesreply.mp3— requires only API keys, no audio hardwareadapter_registrywithASR_PROVIDER=deepgramandASR_PROVIDER=openai-whisper: verifies both registry paths without recompilingKey Design Decisions
voice.rsas a dedicated SDK module (not inline inlib.rs)file_transcriptiongenerates silent WAV internallystreaming_tts(not word-boundary)warn-and-continueon streaming TTS sentence failureSpeechAdapterRegistrykeys come fromadapter.name()Breaking Changes
None. All additions are net-new. No existing APIs were modified. The new
pub mod voiceinmofa-sdkadds to the public surface without changing anything existing.Checklist
examples/voice_agenthas 4 distinct, well-documented binariesmofa-sdk::voicere-exports complete public voice APIcpal,hound,rodio) scoped to example crate only — no library crate pollutiondocs/mofa-doc/src/guides/voice.md)Cargo.toml