A curated list of AI audio generation APIs, SDKs, and production-ready tools — covering text-to-speech, music generation, and sound design. Focused on services developers can integrate today.
Maintained by Backblaze.
- Awesome Image Generation
- Awesome Video Generation
- Awesome ML Data Pipelines
- Awesome Multimodal Data
- Awesome Agent Infrastructure
- Awesome Physical AI
- Text-to-Speech APIs
- Music and Sound Generation APIs
- Open Source TTS Models
- Open Source Music and Audio Models
- Audio Processing
- Voice Agent Frameworks
- Real-Time Audio Infrastructure
- GPU Cloud Providers
- Evaluation and Observability
- Templates and Example Projects
Commercial TTS APIs with hosted inference and developer SDKs.
- Amazon Polly – Neural TTS with SSML support, accessible through any AWS SDK. Docs | SDK: Python (boto3), Node, Java, .NET, Go, Ruby, PHP
- Cartesia – Ultra-low latency Sonic model family with WebSocket multiplexing. Docs | SDK: Python
- Deepgram – Aura TTS model with WebSocket streaming. Also offers a Voice Agent API bundling STT + TTS + LLM orchestration. Docs | SDK: Python, JavaScript, .NET, Go
- ElevenLabs – Industry-leading TTS with voice cloning, multilingual support, and streaming. Flagship eleven_v3 model. Docs | SDK: Python, JavaScript, Flutter, Swift, Kotlin
- Fish Audio – Hosted TTS and voice cloning API backed by the open-source Fish Speech model. S2 Pro supports 80+ languages, voice cloning from 15s of audio, WebSocket streaming. Python and JavaScript SDKs. Docs | SDK: Python, JavaScript
- Google Cloud TTS – 380+ voices across 75+ languages. Chirp 3 HD voices available. Docs | SDK: Python, Java, Node, Go, Ruby, PHP, C#
- Hume AI (Octave) – LLM-trained TTS that understands context and emotion. Octave 2 supports 11+ languages. Also includes EVI for voice agents. Docs | SDK: Python, TypeScript, .NET
- Inworld TTS – Commercial TTS API ranked. Docs
- LMNT – Ultra-low latency TTS. Voice cloning with 5 seconds of audio. Docs | SDK: Python, Node
- Microsoft Azure Speech – HD voices GA since March 2025. SSML for fine-grained control. Also includes STT and speaker recognition. Docs | SDK: C#, C++, Python, Java, JavaScript, Go, Swift
- MiniMax Audio – Commercial TTS and voice cloning API. Speech-2.8-HD model supports 40+ languages and 7 emotion modes. Voice cloning from 10s reference clip via REST API. Docs
- Murf AI – 150+ voices, 35+ languages, 20 speaking styles. Murf Falcon for low-latency voice agents. Docs | SDK: Python
- OpenAI TTS – tts-1, tts-1-hd, and gpt-4o-mini-tts models plus a Realtime API (WebSocket/WebRTC) for low-latency streaming speech. Docs | SDK: Python, Node
- Play.ht – Streaming TTS with PlayDialog (multi-voice dialogue) and Play3.0-mini (multilingual). Docs | SDK: Python, Node
- Resemble AI – Custom voice cloning specialist. Also publishes the open-source Chatterbox TTS model. Docs | SDK: Python, Node
- WellSaid Labs – Enterprise-focused TTS. Major API upgrade June 2025 with 50% cost reduction. Docs
- xAI Grok TTS – Commercial TTS API launched April 2026. 5 expressive voices, 20 languages, inline speech tags (laughter, whispers), WebSocket streaming, and voice cloning via custom-voices endpoint. $4.20/1M chars. Docs | SDK: Python
Hosted APIs for music, background audio, and sound-effect generation.
- Beatoven.ai – Royalty-free background music from text/mood descriptions. REST API with Python SDK. Docs | SDK: Python
- Soundraw – Enterprise/B2B royalty-free music generation. Up to 1000 songs/month via API. Docs
- Stability AI (Stable Audio) – Commercial Stable Audio 2.5 model for music and sound generation. Also available on Replicate and fal.ai.
- Suno – Leading music generation (V5 model). No official public API as of early 2026; beta/partner access only.
- Udio – Music generation with v1.5 and v1.5 Allegro models. Developer portal available; limited public docs.
Open-weight speech-synthesis models you can run locally or self-host.
- Bark (Suno) – Generative text-to-audio (speech, music, sound effects). GPT-style + EnCodec architecture. Docs
- OpenVoice – Instant voice cloning by MIT + MyShell. V2 clones tone color across multiple languages. Docs
- Fish Speech / FishAudio S1 – SOTA multilingual TTS with fine-grained emotion control. 13+ languages. Zero-shot voice cloning with 10-30s reference. Docs
- Chatterbox (Resemble AI) – SOTA open-source TTS. Outperformed ElevenLabs in blind tests (63.75% preference). 1M+ HF downloads. Includes Perth neural watermarking. Docs | SDK: Python (pip install chatterbox-tts)
- Tortoise TTS – Multi-voice, high-quality TTS with autoregressive + diffusion decoders. Excellent for voice cloning. SDK: Python (pip install tortoise-tts)
- F5-TTS – Flow-matching zero-shot voice cloning. 335M params, trained on 95k hours. English and Chinese.
- MeloTTS – Fast, CPU-capable multilingual TTS. English, Spanish, French, Chinese, Japanese, Korean. SDK: Python (pip install melotts)
- Kokoro TTS – 82M param model, top-ranked on TTS Arena. ~210x real-time on RTX 4090. 8 languages, 26 voices. Docs | SDK: Python (pip install kokoro)
- StyleTTS2 – Style diffusion + adversarial training. Achieves human-level MOS on LJSpeech. Architecture basis for Kokoro. Docs
- Piper TTS – Fast local TTS that runs on CPU. ~60MB ONNX-based, designed for edge/IoT. Original repo archived; active at OHF-Voice. Docs
- Coqui TTS / XTTS-v2 – 17 languages, <200ms streaming latency, zero-shot voice cloning. Company closed Dec 2023; IDIAP maintains the fork. Docs | SDK: Python (pip install coqui-tts)
- CosyVoice (FunAudioLLM) – Alibaba/FunAudioLLM multilingual TTS. CosyVoice2 streaming at 150ms latency; CosyVoice3 (Dec 2025) adds RL-tuned voice quality. 9+ languages, zero-shot cloning. Docs | SDK: Python (pip install -r requirements.txt)
- Dia (Nari Labs) – 1.6B-param dialogue TTS. Generates multi-speaker audio with nonverbal cues (laughter, sighs) in one pass. Voice cloning via audio prompt. Apache-2.0. Docs | SDK: Python (pip install git+https://github.com/nari-labs/dia.git)
- IndexTTS – Zero-shot TTS with duration control and emotion-timbre disentanglement. IndexTTS-2 adds multilingual support (Chinese, English, Japanese, Spanish). Bilibili model license. Docs
- MOSS-TTS (OpenMOSS) – Open-source TTS model family (Apache-2.0). Five variants covering voice cloning, dialogue, voice design, realtime streaming, and sound effects. 20 languages; MOSS-TTS-Nano runs on 4 CPU cores. Launched Feb 2026. Docs | SDK: Python (pip install -e '.[torch-runtime]')
- Orpheus-TTS – Llama-3B-based TTS with guided emotion tags (laughter, sighs, etc.), zero-shot voice cloning, and 200ms streaming latency. Trained on 100k hours of English speech. Docs | SDK: Python (pip install orpheus-speech)
- OuteTTS – Extends any LLM with TTS via a pure-language-model approach. 0.6B and 1B variants. One-shot voice cloning, multilingual, no phoneme pre-processing required. Backends for CUDA, ROCm, Metal, Vulkan. Docs | SDK: Python (pip install outetts)
- Qwen3-TTS – Alibaba Cloud open-source TTS series (0.6B–1.7B). Voice cloning from 3s reference, free-form voice design, 97ms streaming latency. 10 languages. Docs | SDK: Python (pip install qwen-tts)
- TADA (Hume AI) – Open-source speech-language model (1B/3B) with 1:1 text-acoustic alignment. Zero content hallucinations, RTF 0.09 (5x faster than comparable LLM-TTS), 10 languages. Weights under Llama 3.2 license; code MIT. Docs | SDK: Python (pip install hume-tada)
- VibeVoice (Microsoft) – MIT open-source TTS family (1.5B) for long-form multi-speaker conversational audio up to 90 min. Next-token diffusion at 7.5 Hz. ICLR 2026 Oral. Also ships a Realtime-0.5B streaming variant. Docs
- VoxCPM2 (OpenBMB) – 2B tokenizer-free diffusion-AR TTS. Voice Design (voice from text description), controllable cloning, 30 languages, 48kHz stereo. Streaming API support. Docs | SDK: Python (pip install voxcpm)
- Voxtral TTS (Mistral) – 4B open-weight TTS from Mistral. 9 languages, 20 preset voices, 24 kHz output. Deployable via vLLM-Omni; also available as hosted API at console.mistral.ai. Docs
Open-weight models for music, soundscapes, and general audio generation.
- AudioCraft (Meta) – MusicGen (text-to-music), AudioGen (text-to-SFX), EnCodec (neural codec), AudioSeal (watermarking). Docs | SDK: Python (pip install audiocraft)
- ACE-Step – 3.5B param music foundation model. Up to 4 min of music in 20s on A100. 19 languages, all mainstream styles. Docs
- Riffusion – Fine-tuned Stable Diffusion generating spectrograms from text, converted to audio. Flask server + Streamlit app included. Docs
- Stable Audio Open – Open-weight music and sound generation up to 47s (44.1kHz stereo). Docs
- Mustango – Text-to-music with controllable chords, beats, tempo, and key. LDM + Flan-T5. Docs
- ACE-Step 1.5 – Music generation model combining LM planning + Diffusion Transformer synthesis. Full songs in <2s on A100, <10s on RTX 3090. XL variant (4B DiT) released April 2026. SDK: Python (pip install acestep)
- Amphion – OpenMMLab toolkit for TTS, voice conversion, singing voice conversion, and text-to-audio. Includes Vevo2 (speech + singing) and DualCodec neural codec.
- AudioX – Unified diffusion transformer for anything-to-audio generation. Takes text, video, image, music, or audio as input; outputs general audio or music. Accepted at ICLR 2026. CC-BY-NC. Docs
- DiffRhythm – Latent diffusion model for end-to-end full-song generation (vocals + accompaniment). Generates up to 4m45s of music in ~10s. Trained on 1M songs (60k hours). Apache-2.0. Docs
- DiffSinger (OpenVPI) – Advanced singing voice synthesis system using shallow diffusion conditioned on musical score (lyrics, pitch, MIDI). 44.1kHz output with variance models for pitch, energy, and breathiness control. v2.5.1 released Jan 2026.
- HeartMuLa (heartlib) – 3B open-source music LM conditioned on lyrics, style tags, and reference audio. Includes HeartCodec (12.5Hz music codec), HeartCLAP (audio-text alignment), and HeartTranscriptor (lyrics ASR). Docs | SDK: Python (pip install -e .)
- InspireMusic (FunAudioLLM) – Alibaba toolkit for text-to-music, music continuation, and audio super-resolution. Autoregressive transformer + flow-matching model generates 48kHz music from text and audio prompts. Docs
- SongGeneration (Tencent / LeVo) – 4B-param open-source song generation model producing vocals + accompaniment from lyrics and style prompts. LeVo-v2-large released March 2026. Non-commercial license. Docs
- SoulX-Singer – Zero-shot singing voice synthesis supporting Mandarin, English, and Cantonese. Trained on 42k hours of vocal data. Supports melody-conditioned (F0) and score-conditioned (MIDI) control. Companion SVC model released March 2026. Docs
- Step-Audio-EditX – 3B-param RL-trained model for expressive audio editing. Modifies emotion, speaking style, and paralinguistic cues (laugh, exhale, etc.) on existing audio clips. Also supports zero-shot TTS. Apache-2.0. Docs
- TangoFlux – Flow-matching DiT model for text-to-audio and sound-effects generation. 44.1kHz stereo, up to 30s, generated in ~3s on one A40. Research use only (Stability AI Community License). Docs
- ThinkSound (FunAudioLLM) – Chain-of-Thought guided any-modality-to-audio framework (video, text, audio input). Supports video-to-audio SOTA, object-centric sound editing, and targeted audio refinement. NeurIPS 2025. pip install thinksound. Docs | SDK: Python (pip install thinksound)
- YuE (M-A-P) – Open-source LLaMA2-based foundation model for full-song generation. Converts lyrics to vocals + accompaniment across diverse genres and languages. Supports LoRA fine-tuning and in-context style transfer. Docs
Libraries and CLIs for manipulating, analysing, and encoding audio.
- PyDub – High-level Python audio manipulation. Slicing, concatenating, normalizing, format conversion. SDK: Python (pip install pydub)
- librosa – Python audio analysis. Spectrograms, MFCCs, beat tracking, pitch analysis. Docs | SDK: Python (pip install librosa)
- Pedalboard (Spotify) – Python library wrapping JUCE for audio effects, VST/AU plugin loading, audio I/O. SDK: Python (pip install pedalboard)
- Torchaudio – PyTorch's official audio library. I/O, feature extraction, transforms, pretrained models. Docs | SDK: Python (pip install torchaudio)
- soundfile – Reads and writes WAV, FLAC, OGG via libsndfile. Pairs with NumPy/librosa. SDK: Python (pip install soundfile)
- BigVGAN (NVIDIA) – Universal neural vocoder converting mel spectrograms to waveforms. 112M-param model with HuggingFace Hub integration. Generalizes zero-shot to speech, music, and instruments at up to 44kHz. Docs
- ClearerVoice-Studio – ModelScope toolkit for speech enhancement, separation, super-resolution, and target speaker extraction. Includes SpeechScore quality metrics (PESQ, STOI, DNSMOS). SDK: Python (pip install clearvoice)
- FFmpeg – Universal audio/video processing CLI. Backend for PyDub and many TTS pipelines. SDK: Python (ffmpeg-python), Node (fluent-ffmpeg)
- LALAL.AI – Commercial API for AI-powered stem separation (vocals, drums, bass, guitar, strings) and voice cloning. API v1 released Feb 2026 with OpenAPI spec and Python code examples. Docs
- RealtimeTTS – Python library for streaming TTS with minimal latency. Wraps OpenAI, ElevenLabs, Azure, Kokoro, Piper, Coqui, and 15+ other engines. Automatic fallback across engines. v0.6.0 released March 2026. SDK: Python (pip install realtimetts[all])
Frameworks for building real-time STT → LLM → TTS voice agents.
- Pipecat (Daily) – Open-source Python framework for STT → LLM → TTS voice agent pipelines. Modular adapters for ElevenLabs, Cartesia, Deepgram, OpenAI, and more. Docs | SDK: Python (pip install pipecat-ai)
- LiveKit Agents – Open-source framework for real-time voice AI agents with semantic turn detection. Docs | SDK: JavaScript, Python, Swift, Android, Flutter, Go, Rust
- Vapi – Managed voice AI agent platform. Abstracts STT/LLM/TTS orchestration. Integrates ElevenLabs, Deepgram, OpenAI, Anthropic. Docs | SDK: JavaScript, Python, Java, Swift
WebRTC, streaming, and low-latency audio transport platforms.
- LiveKit – Open-source WebRTC SFU (Go). Managed cloud available. Foundation for voice AI agents. Docs
- Daily – WebRTC audio/video API. Creator of Pipecat. Supports OpenAI Audio Models natively. Docs
- OpenAI Realtime API – Native speech-to-speech with GPT-4o. WebSocket and WebRTC. Audio streamed as base64 PCM chunks.
Serverless and on-demand GPU platforms for running audio models.
- fal.ai – Serverless inference for generative media. Docs
- Modal – Serverless Python-first GPU platform. Sub-second cold starts. Docs
- Replicate – Serverless model hosting for open-source audio models. Docs
- RunPod – GPU pods and serverless endpoints. REST, GraphQL, and CLI. Docs
Benchmarks, leaderboards, and perceptual quality metrics for audio and speech.
- WER via Whisper – Standard intelligibility metric. Run synthesized speech through Whisper ASR and measure transcription error rate.
- DNSMOS (Microsoft) – Deep Noise Suppression MOS. Non-intrusive perceptual quality metric for speech enhancement. P.835 framework. Also in torchmetrics.
- UTMOS – MOS prediction system from UTokyo-Sarulab. Predicts Mean Opinion Score without human listeners. Pearson correlation ~0.82. Docs
- Speech Arena (Artificial Analysis) – Broader speech model leaderboard including latency, cost, and quality dimensions.
- TTS Arena (HuggingFace) – Community blind pairwise voting with Elo-style ranking. The most referenced human-preference benchmark for TTS. Docs
Reference implementations, demos, and starter projects.
- Real-Time Voice Cloning – Classic SV2TTS demo. Clone a voice in 5 seconds. ~55k stars.
- Kokoro Web – Self-hosted, OpenAI API-compatible TTS server using Kokoro-82M. Drop-in replacement for OpenAI TTS endpoint.
- Stable Audio Open Demo – Official Stability AI demo notebook for Stable Audio Open 1.0.
- B2 Whisper Transcriber – Audio transcription app using Whisper and Transformers.js with Backblaze B2 cloud storage. B2 integration
- ACE-Step Demo – Official music generation samples and comparisons.
- HuggingFace Speech-to-Speech – Cascaded VAD → STT → LLM → TTS pipeline for local voice agents. Supports Whisper, MeloTTS, and Hugging Face LLMs via server/client or WebSocket.
- Kokoro-FastAPI – Dockerized FastAPI wrapper for Kokoro-82M. OpenAI-compatible /v1/audio/speech endpoint with CPU ONNX and NVIDIA GPU PyTorch support. Drop-in replacement for OpenAI TTS. 4700+ stars.
- LiveKit Agent Examples – Official examples for building voice/video AI agents.
- Voicebox – Local-first open-source voice synthesis studio with 7 TTS engines, voice cloning, timeline editor, and REST API. Runs on CUDA, Metal, ROCm, and DirectML. MIT license. Docs
Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.
Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.
Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.