Skip to content

backblaze-labs/awesome-audio-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Audio Generation Awesome PRs Welcome License: CC0-1.0

A curated list of AI audio generation APIs, SDKs, and production-ready tools — covering text-to-speech, music generation, and sound design. Focused on services developers can integrate today.

Maintained by Backblaze.

Related Lists

Contents


Text-to-Speech APIs

Commercial TTS APIs with hosted inference and developer SDKs.

  • Amazon Polly – Neural TTS with SSML support, accessible through any AWS SDK. Docs | SDK: Python (boto3), Node, Java, .NET, Go, Ruby, PHP
  • Cartesia – Ultra-low latency Sonic model family with WebSocket multiplexing. Docs | SDK: Python
  • Deepgram – Aura TTS model with WebSocket streaming. Also offers a Voice Agent API bundling STT + TTS + LLM orchestration. Docs | SDK: Python, JavaScript, .NET, Go
  • ElevenLabs – Industry-leading TTS with voice cloning, multilingual support, and streaming. Flagship eleven_v3 model. Docs | SDK: Python, JavaScript, Flutter, Swift, Kotlin
  • Fish Audio – Hosted TTS and voice cloning API backed by the open-source Fish Speech model. S2 Pro supports 80+ languages, voice cloning from 15s of audio, WebSocket streaming. Python and JavaScript SDKs. Docs | SDK: Python, JavaScript
  • Google Cloud TTS – 380+ voices across 75+ languages. Chirp 3 HD voices available. Docs | SDK: Python, Java, Node, Go, Ruby, PHP, C#
  • Hume AI (Octave) – LLM-trained TTS that understands context and emotion. Octave 2 supports 11+ languages. Also includes EVI for voice agents. Docs | SDK: Python, TypeScript, .NET
  • Inworld TTS – Commercial TTS API ranked. Docs
  • LMNT – Ultra-low latency TTS. Voice cloning with 5 seconds of audio. Docs | SDK: Python, Node
  • Microsoft Azure Speech – HD voices GA since March 2025. SSML for fine-grained control. Also includes STT and speaker recognition. Docs | SDK: C#, C++, Python, Java, JavaScript, Go, Swift
  • MiniMax Audio – Commercial TTS and voice cloning API. Speech-2.8-HD model supports 40+ languages and 7 emotion modes. Voice cloning from 10s reference clip via REST API. Docs
  • Murf AI – 150+ voices, 35+ languages, 20 speaking styles. Murf Falcon for low-latency voice agents. Docs | SDK: Python
  • OpenAI TTS – tts-1, tts-1-hd, and gpt-4o-mini-tts models plus a Realtime API (WebSocket/WebRTC) for low-latency streaming speech. Docs | SDK: Python, Node
  • Play.ht – Streaming TTS with PlayDialog (multi-voice dialogue) and Play3.0-mini (multilingual). Docs | SDK: Python, Node
  • Resemble AI – Custom voice cloning specialist. Also publishes the open-source Chatterbox TTS model. Docs | SDK: Python, Node
  • WellSaid Labs – Enterprise-focused TTS. Major API upgrade June 2025 with 50% cost reduction. Docs
  • xAI Grok TTS – Commercial TTS API launched April 2026. 5 expressive voices, 20 languages, inline speech tags (laughter, whispers), WebSocket streaming, and voice cloning via custom-voices endpoint. $4.20/1M chars. Docs | SDK: Python

Music and Sound Generation APIs

Hosted APIs for music, background audio, and sound-effect generation.

  • Beatoven.ai – Royalty-free background music from text/mood descriptions. REST API with Python SDK. Docs | SDK: Python
  • Soundraw – Enterprise/B2B royalty-free music generation. Up to 1000 songs/month via API. Docs
  • Stability AI (Stable Audio) – Commercial Stable Audio 2.5 model for music and sound generation. Also available on Replicate and fal.ai.
  • Suno – Leading music generation (V5 model). No official public API as of early 2026; beta/partner access only.
  • Udio – Music generation with v1.5 and v1.5 Allegro models. Developer portal available; limited public docs.

Open Source TTS Models

Open-weight speech-synthesis models you can run locally or self-host.

  • Bark (Suno) – Generative text-to-audio (speech, music, sound effects). GPT-style + EnCodec architecture. Docs
  • OpenVoice – Instant voice cloning by MIT + MyShell. V2 clones tone color across multiple languages. Docs
  • Fish Speech / FishAudio S1 – SOTA multilingual TTS with fine-grained emotion control. 13+ languages. Zero-shot voice cloning with 10-30s reference. Docs
  • Chatterbox (Resemble AI) – SOTA open-source TTS. Outperformed ElevenLabs in blind tests (63.75% preference). 1M+ HF downloads. Includes Perth neural watermarking. Docs | SDK: Python (pip install chatterbox-tts)
  • Tortoise TTS – Multi-voice, high-quality TTS with autoregressive + diffusion decoders. Excellent for voice cloning. SDK: Python (pip install tortoise-tts)
  • F5-TTS – Flow-matching zero-shot voice cloning. 335M params, trained on 95k hours. English and Chinese.
  • MeloTTS – Fast, CPU-capable multilingual TTS. English, Spanish, French, Chinese, Japanese, Korean. SDK: Python (pip install melotts)
  • Kokoro TTS – 82M param model, top-ranked on TTS Arena. ~210x real-time on RTX 4090. 8 languages, 26 voices. Docs | SDK: Python (pip install kokoro)
  • StyleTTS2 – Style diffusion + adversarial training. Achieves human-level MOS on LJSpeech. Architecture basis for Kokoro. Docs
  • Piper TTS – Fast local TTS that runs on CPU. ~60MB ONNX-based, designed for edge/IoT. Original repo archived; active at OHF-Voice. Docs
  • Coqui TTS / XTTS-v2 – 17 languages, <200ms streaming latency, zero-shot voice cloning. Company closed Dec 2023; IDIAP maintains the fork. Docs | SDK: Python (pip install coqui-tts)
  • CosyVoice (FunAudioLLM) – Alibaba/FunAudioLLM multilingual TTS. CosyVoice2 streaming at 150ms latency; CosyVoice3 (Dec 2025) adds RL-tuned voice quality. 9+ languages, zero-shot cloning. Docs | SDK: Python (pip install -r requirements.txt)
  • Dia (Nari Labs) – 1.6B-param dialogue TTS. Generates multi-speaker audio with nonverbal cues (laughter, sighs) in one pass. Voice cloning via audio prompt. Apache-2.0. Docs | SDK: Python (pip install git+https://github.com/nari-labs/dia.git)
  • IndexTTS – Zero-shot TTS with duration control and emotion-timbre disentanglement. IndexTTS-2 adds multilingual support (Chinese, English, Japanese, Spanish). Bilibili model license. Docs
  • MOSS-TTS (OpenMOSS) – Open-source TTS model family (Apache-2.0). Five variants covering voice cloning, dialogue, voice design, realtime streaming, and sound effects. 20 languages; MOSS-TTS-Nano runs on 4 CPU cores. Launched Feb 2026. Docs | SDK: Python (pip install -e '.[torch-runtime]')
  • Orpheus-TTS – Llama-3B-based TTS with guided emotion tags (laughter, sighs, etc.), zero-shot voice cloning, and 200ms streaming latency. Trained on 100k hours of English speech. Docs | SDK: Python (pip install orpheus-speech)
  • OuteTTS – Extends any LLM with TTS via a pure-language-model approach. 0.6B and 1B variants. One-shot voice cloning, multilingual, no phoneme pre-processing required. Backends for CUDA, ROCm, Metal, Vulkan. Docs | SDK: Python (pip install outetts)
  • Qwen3-TTS – Alibaba Cloud open-source TTS series (0.6B–1.7B). Voice cloning from 3s reference, free-form voice design, 97ms streaming latency. 10 languages. Docs | SDK: Python (pip install qwen-tts)
  • TADA (Hume AI) – Open-source speech-language model (1B/3B) with 1:1 text-acoustic alignment. Zero content hallucinations, RTF 0.09 (5x faster than comparable LLM-TTS), 10 languages. Weights under Llama 3.2 license; code MIT. Docs | SDK: Python (pip install hume-tada)
  • VibeVoice (Microsoft) – MIT open-source TTS family (1.5B) for long-form multi-speaker conversational audio up to 90 min. Next-token diffusion at 7.5 Hz. ICLR 2026 Oral. Also ships a Realtime-0.5B streaming variant. Docs
  • VoxCPM2 (OpenBMB) – 2B tokenizer-free diffusion-AR TTS. Voice Design (voice from text description), controllable cloning, 30 languages, 48kHz stereo. Streaming API support. Docs | SDK: Python (pip install voxcpm)
  • Voxtral TTS (Mistral) – 4B open-weight TTS from Mistral. 9 languages, 20 preset voices, 24 kHz output. Deployable via vLLM-Omni; also available as hosted API at console.mistral.ai. Docs

Open Source Music and Audio Models

Open-weight models for music, soundscapes, and general audio generation.

  • AudioCraft (Meta) – MusicGen (text-to-music), AudioGen (text-to-SFX), EnCodec (neural codec), AudioSeal (watermarking). Docs | SDK: Python (pip install audiocraft)
  • ACE-Step – 3.5B param music foundation model. Up to 4 min of music in 20s on A100. 19 languages, all mainstream styles. Docs
  • Riffusion – Fine-tuned Stable Diffusion generating spectrograms from text, converted to audio. Flask server + Streamlit app included. Docs
  • Stable Audio Open – Open-weight music and sound generation up to 47s (44.1kHz stereo). Docs
  • Mustango – Text-to-music with controllable chords, beats, tempo, and key. LDM + Flan-T5. Docs
  • ACE-Step 1.5 – Music generation model combining LM planning + Diffusion Transformer synthesis. Full songs in <2s on A100, <10s on RTX 3090. XL variant (4B DiT) released April 2026. SDK: Python (pip install acestep)
  • Amphion – OpenMMLab toolkit for TTS, voice conversion, singing voice conversion, and text-to-audio. Includes Vevo2 (speech + singing) and DualCodec neural codec.
  • AudioX – Unified diffusion transformer for anything-to-audio generation. Takes text, video, image, music, or audio as input; outputs general audio or music. Accepted at ICLR 2026. CC-BY-NC. Docs
  • DiffRhythm – Latent diffusion model for end-to-end full-song generation (vocals + accompaniment). Generates up to 4m45s of music in ~10s. Trained on 1M songs (60k hours). Apache-2.0. Docs
  • DiffSinger (OpenVPI) – Advanced singing voice synthesis system using shallow diffusion conditioned on musical score (lyrics, pitch, MIDI). 44.1kHz output with variance models for pitch, energy, and breathiness control. v2.5.1 released Jan 2026.
  • HeartMuLa (heartlib) – 3B open-source music LM conditioned on lyrics, style tags, and reference audio. Includes HeartCodec (12.5Hz music codec), HeartCLAP (audio-text alignment), and HeartTranscriptor (lyrics ASR). Docs | SDK: Python (pip install -e .)
  • InspireMusic (FunAudioLLM) – Alibaba toolkit for text-to-music, music continuation, and audio super-resolution. Autoregressive transformer + flow-matching model generates 48kHz music from text and audio prompts. Docs
  • SongGeneration (Tencent / LeVo) – 4B-param open-source song generation model producing vocals + accompaniment from lyrics and style prompts. LeVo-v2-large released March 2026. Non-commercial license. Docs
  • SoulX-Singer – Zero-shot singing voice synthesis supporting Mandarin, English, and Cantonese. Trained on 42k hours of vocal data. Supports melody-conditioned (F0) and score-conditioned (MIDI) control. Companion SVC model released March 2026. Docs
  • Step-Audio-EditX – 3B-param RL-trained model for expressive audio editing. Modifies emotion, speaking style, and paralinguistic cues (laugh, exhale, etc.) on existing audio clips. Also supports zero-shot TTS. Apache-2.0. Docs
  • TangoFlux – Flow-matching DiT model for text-to-audio and sound-effects generation. 44.1kHz stereo, up to 30s, generated in ~3s on one A40. Research use only (Stability AI Community License). Docs
  • ThinkSound (FunAudioLLM) – Chain-of-Thought guided any-modality-to-audio framework (video, text, audio input). Supports video-to-audio SOTA, object-centric sound editing, and targeted audio refinement. NeurIPS 2025. pip install thinksound. Docs | SDK: Python (pip install thinksound)
  • YuE (M-A-P) – Open-source LLaMA2-based foundation model for full-song generation. Converts lyrics to vocals + accompaniment across diverse genres and languages. Supports LoRA fine-tuning and in-context style transfer. Docs

Audio Processing

Libraries and CLIs for manipulating, analysing, and encoding audio.

  • PyDub – High-level Python audio manipulation. Slicing, concatenating, normalizing, format conversion. SDK: Python (pip install pydub)
  • librosa – Python audio analysis. Spectrograms, MFCCs, beat tracking, pitch analysis. Docs | SDK: Python (pip install librosa)
  • Pedalboard (Spotify) – Python library wrapping JUCE for audio effects, VST/AU plugin loading, audio I/O. SDK: Python (pip install pedalboard)
  • Torchaudio – PyTorch's official audio library. I/O, feature extraction, transforms, pretrained models. Docs | SDK: Python (pip install torchaudio)
  • soundfile – Reads and writes WAV, FLAC, OGG via libsndfile. Pairs with NumPy/librosa. SDK: Python (pip install soundfile)
  • BigVGAN (NVIDIA) – Universal neural vocoder converting mel spectrograms to waveforms. 112M-param model with HuggingFace Hub integration. Generalizes zero-shot to speech, music, and instruments at up to 44kHz. Docs
  • ClearerVoice-Studio – ModelScope toolkit for speech enhancement, separation, super-resolution, and target speaker extraction. Includes SpeechScore quality metrics (PESQ, STOI, DNSMOS). SDK: Python (pip install clearvoice)
  • FFmpeg – Universal audio/video processing CLI. Backend for PyDub and many TTS pipelines. SDK: Python (ffmpeg-python), Node (fluent-ffmpeg)
  • LALAL.AI – Commercial API for AI-powered stem separation (vocals, drums, bass, guitar, strings) and voice cloning. API v1 released Feb 2026 with OpenAPI spec and Python code examples. Docs
  • RealtimeTTS – Python library for streaming TTS with minimal latency. Wraps OpenAI, ElevenLabs, Azure, Kokoro, Piper, Coqui, and 15+ other engines. Automatic fallback across engines. v0.6.0 released March 2026. SDK: Python (pip install realtimetts[all])

Voice Agent Frameworks

Frameworks for building real-time STT → LLM → TTS voice agents.

  • Pipecat (Daily) – Open-source Python framework for STT → LLM → TTS voice agent pipelines. Modular adapters for ElevenLabs, Cartesia, Deepgram, OpenAI, and more. Docs | SDK: Python (pip install pipecat-ai)
  • LiveKit Agents – Open-source framework for real-time voice AI agents with semantic turn detection. Docs | SDK: JavaScript, Python, Swift, Android, Flutter, Go, Rust
  • Vapi – Managed voice AI agent platform. Abstracts STT/LLM/TTS orchestration. Integrates ElevenLabs, Deepgram, OpenAI, Anthropic. Docs | SDK: JavaScript, Python, Java, Swift

Real-Time Audio Infrastructure

WebRTC, streaming, and low-latency audio transport platforms.

  • LiveKit – Open-source WebRTC SFU (Go). Managed cloud available. Foundation for voice AI agents. Docs
  • Daily – WebRTC audio/video API. Creator of Pipecat. Supports OpenAI Audio Models natively. Docs
  • OpenAI Realtime API – Native speech-to-speech with GPT-4o. WebSocket and WebRTC. Audio streamed as base64 PCM chunks.

GPU Cloud Providers

Serverless and on-demand GPU platforms for running audio models.

  • fal.ai – Serverless inference for generative media. Docs
  • Modal – Serverless Python-first GPU platform. Sub-second cold starts. Docs
  • Replicate – Serverless model hosting for open-source audio models. Docs
  • RunPod – GPU pods and serverless endpoints. REST, GraphQL, and CLI. Docs

Evaluation and Observability

Benchmarks, leaderboards, and perceptual quality metrics for audio and speech.

  • WER via Whisper – Standard intelligibility metric. Run synthesized speech through Whisper ASR and measure transcription error rate.
  • DNSMOS (Microsoft) – Deep Noise Suppression MOS. Non-intrusive perceptual quality metric for speech enhancement. P.835 framework. Also in torchmetrics.
  • UTMOS – MOS prediction system from UTokyo-Sarulab. Predicts Mean Opinion Score without human listeners. Pearson correlation ~0.82. Docs
  • Speech Arena (Artificial Analysis) – Broader speech model leaderboard including latency, cost, and quality dimensions.
  • TTS Arena (HuggingFace) – Community blind pairwise voting with Elo-style ranking. The most referenced human-preference benchmark for TTS. Docs

Templates and Example Projects

Reference implementations, demos, and starter projects.

  • Real-Time Voice Cloning – Classic SV2TTS demo. Clone a voice in 5 seconds. ~55k stars.
  • Kokoro Web – Self-hosted, OpenAI API-compatible TTS server using Kokoro-82M. Drop-in replacement for OpenAI TTS endpoint.
  • Stable Audio Open Demo – Official Stability AI demo notebook for Stable Audio Open 1.0.
  • B2 Whisper Transcriber – Audio transcription app using Whisper and Transformers.js with Backblaze B2 cloud storage. B2 integration
  • ACE-Step Demo – Official music generation samples and comparisons.
  • HuggingFace Speech-to-Speech – Cascaded VAD → STT → LLM → TTS pipeline for local voice agents. Supports Whisper, MeloTTS, and Hugging Face LLMs via server/client or WebSocket.
  • Kokoro-FastAPI – Dockerized FastAPI wrapper for Kokoro-82M. OpenAI-compatible /v1/audio/speech endpoint with CPU ONNX and NVIDIA GPU PyTorch support. Drop-in replacement for OpenAI TTS. 4700+ stars.
  • LiveKit Agent Examples – Official examples for building voice/video AI agents.
  • Voicebox – Local-first open-source voice synthesis studio with 7 TTS engines, voice cloning, timeline editor, and REST API. Runs on CUDA, Metal, ROCm, and DirectML. MIT license. Docs

Contributing

Contributions are welcome. See CONTRIBUTING.md. One entry per PR — edit entries.yaml only and let the maintainers regenerate README.md.

License

Released under CC0 1.0 Universal. You may copy, modify, and redistribute without attribution.

About Backblaze B2

Backblaze B2 Cloud Storage is S3-compatible object storage designed for AI and media workloads. This list is maintained as part of our work making B2 a convenient storage layer for AI workflows.

About

A curated list of AI audio generation APIs, SDKs, and tools including text-to-speech, speech synthesis, music generation, voice cloning, sound design, and generative AI platforms. Covers commercial services, open source models with APIs, and production-ready infrastructure for developers building audio applications.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors