Skip to content

Commit aca83cd

Browse files
committed
auto combine test
1 parent 3db1d74 commit aca83cd

13 files changed

+3223
-1
lines changed

COMBINED_AUDIO_API.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Combined Audio API Documentation
2+
3+
This document describes the new combined audio generation endpoints that allow you to generate a single audio file from long text by automatically splitting the text into chunks and combining the resulting audio.
4+
5+
## Overview
6+
7+
The combined audio functionality addresses the limitation of TTS services that have character limits (typically 4096 characters). Instead of requiring you to manually split long text and manage multiple audio files, these endpoints:
8+
9+
1. **Automatically split** long text into optimal chunks
10+
2. **Generate speech** for each chunk in parallel
11+
3. **Combine audio chunks** into a single seamless audio file
12+
4. **Return the combined audio** as a single download
13+
14+
## Endpoints
15+
16+
### 1. `/api/generate-combined` (POST)
17+
18+
**Description**: Generate combined audio from long text using TTSFM's native API format.
19+
20+
**Request Body**:
21+
```json
22+
{
23+
"text": "Your long text content here...",
24+
"voice": "alloy",
25+
"format": "mp3",
26+
"instructions": "Optional voice instructions",
27+
"max_length": 4096,
28+
"preserve_words": true
29+
}
30+
```
31+
32+
**Parameters**:
33+
- `text` (string, required): The text to convert to speech
34+
- `voice` (string, optional): Voice to use. Options: `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`, `verse`. Default: `alloy`
35+
- `format` (string, optional): Audio format. Options: `mp3`, `wav`, `opus`, `aac`, `flac`, `pcm`. Default: `mp3`
36+
- `instructions` (string, optional): Custom instructions for voice modulation
37+
- `max_length` (integer, optional): Maximum characters per chunk. Default: `4096`
38+
- `preserve_words` (boolean, optional): Whether to preserve word boundaries when splitting. Default: `true`
39+
40+
**Response**:
41+
- **Success (200)**: Returns audio file as binary data
42+
- **Error (400/500)**: Returns JSON error message
43+
44+
**Response Headers**:
45+
- `Content-Type`: Audio MIME type (e.g., `audio/mpeg`)
46+
- `Content-Disposition`: Attachment filename
47+
- `Content-Length`: File size in bytes
48+
- `X-Audio-Format`: Audio format used
49+
- `X-Audio-Size`: Audio file size
50+
- `X-Chunks-Combined`: Number of chunks that were combined
51+
- `X-Original-Text-Length`: Original text length in characters
52+
53+
### 2. `/v1/audio/speech-combined` (POST)
54+
55+
**Description**: OpenAI-compatible endpoint for combined audio generation.
56+
57+
**Request Body**:
58+
```json
59+
{
60+
"model": "gpt-4o-mini-tts",
61+
"input": "Your long text content here...",
62+
"voice": "alloy",
63+
"response_format": "mp3",
64+
"instructions": "Optional voice instructions",
65+
"speed": 1.0,
66+
"max_length": 4096
67+
}
68+
```
69+
70+
**Parameters**:
71+
- `model` (string, optional): Model name (accepted but ignored for compatibility). Default: `gpt-4o-mini-tts`
72+
- `input` (string, required): The text to convert to speech
73+
- `voice` (string, optional): Voice to use (same options as above). Default: `alloy`
74+
- `response_format` (string, optional): Audio format (same options as above). Default: `mp3`
75+
- `instructions` (string, optional): Custom instructions for voice modulation
76+
- `speed` (float, optional): Speech speed (accepted but ignored for compatibility). Default: `1.0`
77+
- `max_length` (integer, optional): Maximum characters per chunk. Default: `4096`
78+
79+
**Response**: Same as `/api/generate-combined`
80+
81+
## Text Splitting Algorithm
82+
83+
The system uses an intelligent text splitting algorithm with the following priority:
84+
85+
1. **Sentence boundaries** (`.`, `!`, `?`) - Preferred for natural speech flow
86+
2. **Word boundaries** (spaces) - Fallback when sentences are too long
87+
3. **Character boundaries** - Last resort for extremely long words
88+
89+
### Example Splitting:
90+
91+
**Input text (150 chars, max_length=100)**:
92+
```
93+
"This is sentence one. This is sentence two! This is a very long sentence that exceeds the limit and needs splitting."
94+
```
95+
96+
**Output chunks**:
97+
```
98+
Chunk 1: "This is sentence one. This is sentence two!"
99+
Chunk 2: "This is a very long sentence that exceeds the limit and needs splitting."
100+
```
101+
102+
## Audio Combination
103+
104+
The system combines audio chunks using:
105+
106+
1. **PyDub library** (preferred): Professional audio processing with format support
107+
2. **Simple WAV concatenation** (fallback): Basic concatenation for WAV files when PyDub is unavailable
108+
3. **Raw concatenation** (last resort): Simple byte concatenation for other formats
109+
110+
### Supported Formats for Combination:
111+
- **MP3**: Full support with PyDub, raw concatenation fallback
112+
- **WAV**: Full support with PyDub, intelligent concatenation fallback
113+
- **OPUS/AAC/FLAC/PCM**: Full support with PyDub, raw concatenation fallback
114+
115+
## Usage Examples
116+
117+
### Python with requests:
118+
119+
```python
120+
import requests
121+
122+
# Long text example
123+
long_text = "Your very long text content here..." * 10
124+
125+
# Using native API
126+
response = requests.post(
127+
"http://localhost:8000/api/generate-combined",
128+
json={
129+
"text": long_text,
130+
"voice": "nova",
131+
"format": "mp3",
132+
"max_length": 2000
133+
}
134+
)
135+
136+
if response.status_code == 200:
137+
with open("combined_audio.mp3", "wb") as f:
138+
f.write(response.content)
139+
140+
chunks_combined = response.headers.get('X-Chunks-Combined')
141+
print(f"Successfully combined {chunks_combined} chunks")
142+
143+
# Using OpenAI-compatible API
144+
response = requests.post(
145+
"http://localhost:8000/v1/audio/speech-combined",
146+
json={
147+
"model": "gpt-4o-mini-tts",
148+
"input": long_text,
149+
"voice": "alloy",
150+
"response_format": "wav"
151+
}
152+
)
153+
```
154+
155+
### cURL:
156+
157+
```bash
158+
# Native API
159+
curl -X POST http://localhost:8000/api/generate-combined \
160+
-H "Content-Type: application/json" \
161+
-d '{
162+
"text": "Your long text here...",
163+
"voice": "alloy",
164+
"format": "mp3",
165+
"max_length": 4096
166+
}' \
167+
--output combined_audio.mp3
168+
169+
# OpenAI-compatible API
170+
curl -X POST http://localhost:8000/v1/audio/speech-combined \
171+
-H "Content-Type: application/json" \
172+
-d '{
173+
"model": "gpt-4o-mini-tts",
174+
"input": "Your long text here...",
175+
"voice": "nova",
176+
"response_format": "wav"
177+
}' \
178+
--output combined_audio.wav
179+
```
180+
181+
## Error Handling
182+
183+
### Common Error Responses:
184+
185+
**400 Bad Request**:
186+
```json
187+
{
188+
"error": "Text is required"
189+
}
190+
```
191+
192+
**400 Invalid Voice**:
193+
```json
194+
{
195+
"error": "Invalid voice: invalid_voice. Must be one of: ['alloy', 'ash', ...]"
196+
}
197+
```
198+
199+
**500 Processing Error**:
200+
```json
201+
{
202+
"error": "Failed to combine audio chunks"
203+
}
204+
```
205+
206+
**503 Service Unavailable**:
207+
```json
208+
{
209+
"error": "TTS service is currently unavailable"
210+
}
211+
```
212+
213+
## Performance Considerations
214+
215+
- **Chunk processing**: Chunks are processed concurrently for faster generation
216+
- **Memory usage**: Audio combination is optimized for minimal memory footprint
217+
- **Timeout**: Allow longer timeouts for large text processing (recommended: 60-120 seconds)
218+
- **File size**: Combined audio files will be larger than individual chunks
219+
220+
## Installation Requirements
221+
222+
For optimal audio combination, install PyDub:
223+
224+
```bash
225+
pip install pydub
226+
```
227+
228+
**Note**: PyDub is optional. The system will fall back to simpler concatenation methods if PyDub is not available.
229+
230+
## Limitations
231+
232+
1. **Maximum text length**: While there's no hard limit, very long texts (>50,000 characters) may take significant time to process
233+
2. **Audio quality**: Quality depends on the underlying TTS service (openai.fm)
234+
3. **Format support**: Some advanced audio processing features require PyDub
235+
4. **Processing time**: Longer texts require more processing time due to chunking and combination
236+
237+
## Testing
238+
239+
Use the provided test script to verify functionality:
240+
241+
```bash
242+
python test_combined_endpoint.py
243+
```
244+
245+
This will test both endpoints and generate sample audio files for verification.

0 commit comments

Comments
 (0)