|
| 1 | +# Combined Audio API Documentation |
| 2 | + |
| 3 | +This document describes the new combined audio generation endpoints that allow you to generate a single audio file from long text by automatically splitting the text into chunks and combining the resulting audio. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The combined audio functionality addresses the limitation of TTS services that have character limits (typically 4096 characters). Instead of requiring you to manually split long text and manage multiple audio files, these endpoints: |
| 8 | + |
| 9 | +1. **Automatically split** long text into optimal chunks |
| 10 | +2. **Generate speech** for each chunk in parallel |
| 11 | +3. **Combine audio chunks** into a single seamless audio file |
| 12 | +4. **Return the combined audio** as a single download |
| 13 | + |
| 14 | +## Endpoints |
| 15 | + |
| 16 | +### 1. `/api/generate-combined` (POST) |
| 17 | + |
| 18 | +**Description**: Generate combined audio from long text using TTSFM's native API format. |
| 19 | + |
| 20 | +**Request Body**: |
| 21 | +```json |
| 22 | +{ |
| 23 | + "text": "Your long text content here...", |
| 24 | + "voice": "alloy", |
| 25 | + "format": "mp3", |
| 26 | + "instructions": "Optional voice instructions", |
| 27 | + "max_length": 4096, |
| 28 | + "preserve_words": true |
| 29 | +} |
| 30 | +``` |
| 31 | + |
| 32 | +**Parameters**: |
| 33 | +- `text` (string, required): The text to convert to speech |
| 34 | +- `voice` (string, optional): Voice to use. Options: `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`, `verse`. Default: `alloy` |
| 35 | +- `format` (string, optional): Audio format. Options: `mp3`, `wav`, `opus`, `aac`, `flac`, `pcm`. Default: `mp3` |
| 36 | +- `instructions` (string, optional): Custom instructions for voice modulation |
| 37 | +- `max_length` (integer, optional): Maximum characters per chunk. Default: `4096` |
| 38 | +- `preserve_words` (boolean, optional): Whether to preserve word boundaries when splitting. Default: `true` |
| 39 | + |
| 40 | +**Response**: |
| 41 | +- **Success (200)**: Returns audio file as binary data |
| 42 | +- **Error (400/500)**: Returns JSON error message |
| 43 | + |
| 44 | +**Response Headers**: |
| 45 | +- `Content-Type`: Audio MIME type (e.g., `audio/mpeg`) |
| 46 | +- `Content-Disposition`: Attachment filename |
| 47 | +- `Content-Length`: File size in bytes |
| 48 | +- `X-Audio-Format`: Audio format used |
| 49 | +- `X-Audio-Size`: Audio file size |
| 50 | +- `X-Chunks-Combined`: Number of chunks that were combined |
| 51 | +- `X-Original-Text-Length`: Original text length in characters |
| 52 | + |
| 53 | +### 2. `/v1/audio/speech-combined` (POST) |
| 54 | + |
| 55 | +**Description**: OpenAI-compatible endpoint for combined audio generation. |
| 56 | + |
| 57 | +**Request Body**: |
| 58 | +```json |
| 59 | +{ |
| 60 | + "model": "gpt-4o-mini-tts", |
| 61 | + "input": "Your long text content here...", |
| 62 | + "voice": "alloy", |
| 63 | + "response_format": "mp3", |
| 64 | + "instructions": "Optional voice instructions", |
| 65 | + "speed": 1.0, |
| 66 | + "max_length": 4096 |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +**Parameters**: |
| 71 | +- `model` (string, optional): Model name (accepted but ignored for compatibility). Default: `gpt-4o-mini-tts` |
| 72 | +- `input` (string, required): The text to convert to speech |
| 73 | +- `voice` (string, optional): Voice to use (same options as above). Default: `alloy` |
| 74 | +- `response_format` (string, optional): Audio format (same options as above). Default: `mp3` |
| 75 | +- `instructions` (string, optional): Custom instructions for voice modulation |
| 76 | +- `speed` (float, optional): Speech speed (accepted but ignored for compatibility). Default: `1.0` |
| 77 | +- `max_length` (integer, optional): Maximum characters per chunk. Default: `4096` |
| 78 | + |
| 79 | +**Response**: Same as `/api/generate-combined` |
| 80 | + |
| 81 | +## Text Splitting Algorithm |
| 82 | + |
| 83 | +The system uses an intelligent text splitting algorithm with the following priority: |
| 84 | + |
| 85 | +1. **Sentence boundaries** (`.`, `!`, `?`) - Preferred for natural speech flow |
| 86 | +2. **Word boundaries** (spaces) - Fallback when sentences are too long |
| 87 | +3. **Character boundaries** - Last resort for extremely long words |
| 88 | + |
| 89 | +### Example Splitting: |
| 90 | + |
| 91 | +**Input text (150 chars, max_length=100)**: |
| 92 | +``` |
| 93 | +"This is sentence one. This is sentence two! This is a very long sentence that exceeds the limit and needs splitting." |
| 94 | +``` |
| 95 | + |
| 96 | +**Output chunks**: |
| 97 | +``` |
| 98 | +Chunk 1: "This is sentence one. This is sentence two!" |
| 99 | +Chunk 2: "This is a very long sentence that exceeds the limit and needs splitting." |
| 100 | +``` |
| 101 | + |
| 102 | +## Audio Combination |
| 103 | + |
| 104 | +The system combines audio chunks using: |
| 105 | + |
| 106 | +1. **PyDub library** (preferred): Professional audio processing with format support |
| 107 | +2. **Simple WAV concatenation** (fallback): Basic concatenation for WAV files when PyDub is unavailable |
| 108 | +3. **Raw concatenation** (last resort): Simple byte concatenation for other formats |
| 109 | + |
| 110 | +### Supported Formats for Combination: |
| 111 | +- **MP3**: Full support with PyDub, raw concatenation fallback |
| 112 | +- **WAV**: Full support with PyDub, intelligent concatenation fallback |
| 113 | +- **OPUS/AAC/FLAC/PCM**: Full support with PyDub, raw concatenation fallback |
| 114 | + |
| 115 | +## Usage Examples |
| 116 | + |
| 117 | +### Python with requests: |
| 118 | + |
| 119 | +```python |
| 120 | +import requests |
| 121 | + |
| 122 | +# Long text example |
| 123 | +long_text = "Your very long text content here..." * 10 |
| 124 | + |
| 125 | +# Using native API |
| 126 | +response = requests.post( |
| 127 | + "http://localhost:8000/api/generate-combined", |
| 128 | + json={ |
| 129 | + "text": long_text, |
| 130 | + "voice": "nova", |
| 131 | + "format": "mp3", |
| 132 | + "max_length": 2000 |
| 133 | + } |
| 134 | +) |
| 135 | + |
| 136 | +if response.status_code == 200: |
| 137 | + with open("combined_audio.mp3", "wb") as f: |
| 138 | + f.write(response.content) |
| 139 | + |
| 140 | + chunks_combined = response.headers.get('X-Chunks-Combined') |
| 141 | + print(f"Successfully combined {chunks_combined} chunks") |
| 142 | + |
| 143 | +# Using OpenAI-compatible API |
| 144 | +response = requests.post( |
| 145 | + "http://localhost:8000/v1/audio/speech-combined", |
| 146 | + json={ |
| 147 | + "model": "gpt-4o-mini-tts", |
| 148 | + "input": long_text, |
| 149 | + "voice": "alloy", |
| 150 | + "response_format": "wav" |
| 151 | + } |
| 152 | +) |
| 153 | +``` |
| 154 | + |
| 155 | +### cURL: |
| 156 | + |
| 157 | +```bash |
| 158 | +# Native API |
| 159 | +curl -X POST http://localhost:8000/api/generate-combined \ |
| 160 | + -H "Content-Type: application/json" \ |
| 161 | + -d '{ |
| 162 | + "text": "Your long text here...", |
| 163 | + "voice": "alloy", |
| 164 | + "format": "mp3", |
| 165 | + "max_length": 4096 |
| 166 | + }' \ |
| 167 | + --output combined_audio.mp3 |
| 168 | + |
| 169 | +# OpenAI-compatible API |
| 170 | +curl -X POST http://localhost:8000/v1/audio/speech-combined \ |
| 171 | + -H "Content-Type: application/json" \ |
| 172 | + -d '{ |
| 173 | + "model": "gpt-4o-mini-tts", |
| 174 | + "input": "Your long text here...", |
| 175 | + "voice": "nova", |
| 176 | + "response_format": "wav" |
| 177 | + }' \ |
| 178 | + --output combined_audio.wav |
| 179 | +``` |
| 180 | + |
| 181 | +## Error Handling |
| 182 | + |
| 183 | +### Common Error Responses: |
| 184 | + |
| 185 | +**400 Bad Request**: |
| 186 | +```json |
| 187 | +{ |
| 188 | + "error": "Text is required" |
| 189 | +} |
| 190 | +``` |
| 191 | + |
| 192 | +**400 Invalid Voice**: |
| 193 | +```json |
| 194 | +{ |
| 195 | + "error": "Invalid voice: invalid_voice. Must be one of: ['alloy', 'ash', ...]" |
| 196 | +} |
| 197 | +``` |
| 198 | + |
| 199 | +**500 Processing Error**: |
| 200 | +```json |
| 201 | +{ |
| 202 | + "error": "Failed to combine audio chunks" |
| 203 | +} |
| 204 | +``` |
| 205 | + |
| 206 | +**503 Service Unavailable**: |
| 207 | +```json |
| 208 | +{ |
| 209 | + "error": "TTS service is currently unavailable" |
| 210 | +} |
| 211 | +``` |
| 212 | + |
| 213 | +## Performance Considerations |
| 214 | + |
| 215 | +- **Chunk processing**: Chunks are processed concurrently for faster generation |
| 216 | +- **Memory usage**: Audio combination is optimized for minimal memory footprint |
| 217 | +- **Timeout**: Allow longer timeouts for large text processing (recommended: 60-120 seconds) |
| 218 | +- **File size**: Combined audio files will be larger than individual chunks |
| 219 | + |
| 220 | +## Installation Requirements |
| 221 | + |
| 222 | +For optimal audio combination, install PyDub: |
| 223 | + |
| 224 | +```bash |
| 225 | +pip install pydub |
| 226 | +``` |
| 227 | + |
| 228 | +**Note**: PyDub is optional. The system will fall back to simpler concatenation methods if PyDub is not available. |
| 229 | + |
| 230 | +## Limitations |
| 231 | + |
| 232 | +1. **Maximum text length**: While there's no hard limit, very long texts (>50,000 characters) may take significant time to process |
| 233 | +2. **Audio quality**: Quality depends on the underlying TTS service (openai.fm) |
| 234 | +3. **Format support**: Some advanced audio processing features require PyDub |
| 235 | +4. **Processing time**: Longer texts require more processing time due to chunking and combination |
| 236 | + |
| 237 | +## Testing |
| 238 | + |
| 239 | +Use the provided test script to verify functionality: |
| 240 | + |
| 241 | +```bash |
| 242 | +python test_combined_endpoint.py |
| 243 | +``` |
| 244 | + |
| 245 | +This will test both endpoints and generate sample audio files for verification. |
0 commit comments