Skip to content

Commit f6c8c51

Browse files
committed
Implement real audio format conversion with ffmpeg
Replaces legacy format mapping with actual conversion for OPUS, AAC, FLAC, and PCM using ffmpeg, ensuring correct MIME types and file extensions. Adds format selector to playground UI, updates client and async client to handle conversion and content-type headers, and removes get_supported_format and maps_to_wav from models. Updates version to 3.4.0-alpha4 and fixes speed display in playground.
1 parent c7952a0 commit f6c8c51

File tree

12 files changed

+302
-148
lines changed

12 files changed

+302
-148
lines changed

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,43 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [3.4.0-alpha4] - 2025-10-28
9+
10+
### Added
11+
- **Format conversion with ffmpeg**: All 6 audio formats now properly converted using ffmpeg
12+
- MP3, WAV: Direct from openai.fm (no conversion needed)
13+
- OPUS, AAC, FLAC, PCM: Converted from WAV using ffmpeg
14+
- Proper MIME type headers for each format (audio/opus, audio/aac, audio/flac, audio/pcm)
15+
- Downloads now have correct file extensions (.opus, .aac, .flac, .pcm)
16+
- **Format selector in playground**: Added dropdown to select audio format in web UI
17+
- Clean display showing only format names (mp3, wav, opus, aac, flac, pcm)
18+
- Integrated with existing playground functionality
19+
20+
### Fixed
21+
- **Content-Type headers after format conversion**: Fixed issue where converted formats returned wrong content-type
22+
- Added `_get_content_type_for_format()` helper method to both sync and async clients
23+
- Content-type now properly updated after ffmpeg conversion
24+
- Downloads now use correct file extensions based on actual format
25+
- **Speed display in playground**: Fixed bug where speed always showed "1.0x" regardless of actual speed
26+
- Updated `buildGenerationMeta()` to include speed and speedApplied fields
27+
- Speed now correctly displayed in audio stats (0.25x, 0.5x, 1.0x, 1.5x, 2.0x, 4.0x)
28+
29+
### Changed
30+
- **Removed legacy format mapping**: Eliminated header-based format "faking" in favor of real conversion
31+
- Removed `get_supported_format()` and `maps_to_wav()` functions from `ttsfm/models.py`
32+
- Simplified client code by ~30 lines
33+
- All formats now return actual requested format, not approximations
34+
- **Migrated playground to OpenAI API**: Removed old `/api/generate` endpoints
35+
- Playground now uses `/v1/audio/speech` endpoint exclusively
36+
- Consistent API format across all interfaces
37+
- Speed parameter now works correctly in playground
38+
39+
### Technical
40+
- Format conversion uses `convert_audio_format()` from `audio_processing.py`
41+
- Async client runs ffmpeg conversion in thread pool to avoid blocking
42+
- Graceful fallback to original format if ffmpeg unavailable
43+
- All 25 tests passing with new format conversion logic
44+
845
## [3.4.0-alpha3] - 2025-10-26
946

1047
### Fixed

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ ttsfm = "ttsfm.cli:main"
8686
version_scheme = "no-guess-dev"
8787
local_scheme = "no-local-version"
8888

89-
fallback_version = "3.4.0-alpha3"
89+
fallback_version = "3.4.0-alpha4"
9090
[tool.setuptools]
9191
packages = ["ttsfm"]
9292

tests/test_audio_processing.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,7 @@ def test_adjust_audio_speed_no_change(self):
4848
result = adjust_audio_speed(dummy_audio, speed=1.0)
4949
assert result == dummy_audio
5050

51-
@pytest.mark.skipif(
52-
not shutil.which("ffmpeg"),
53-
reason="ffmpeg not available"
54-
)
51+
@pytest.mark.skipif(not shutil.which("ffmpeg"), reason="ffmpeg not available")
5552
def test_adjust_audio_speed_requires_ffmpeg(self):
5653
"""Test that speed adjustment requires ffmpeg."""
5754
# This test only runs if ffmpeg is available
@@ -90,6 +87,7 @@ def test_combine_mp3_without_ffmpeg(self, monkeypatch):
9087
"""Test that MP3 combining fails gracefully without ffmpeg."""
9188
# Mock both pydub and ffmpeg as unavailable
9289
import ttsfm.audio
90+
9391
monkeypatch.setattr(ttsfm.audio, "AudioSegment", None)
9492
monkeypatch.setattr(ttsfm.audio, "FFMPEG_AVAILABLE", False)
9593

@@ -104,6 +102,7 @@ def test_combine_wav_without_ffmpeg(self, monkeypatch):
104102
"""Test that WAV combining works without ffmpeg."""
105103
# Mock pydub as unavailable but allow WAV concatenation
106104
import ttsfm.audio
105+
107106
monkeypatch.setattr(ttsfm.audio, "AudioSegment", None)
108107

109108
from ttsfm.audio import combine_audio_chunks
@@ -115,4 +114,3 @@ def test_combine_wav_without_ffmpeg(self, monkeypatch):
115114
# Should not raise error for WAV
116115
result = combine_audio_chunks(chunks, format_type="wav")
117116
assert isinstance(result, bytes)
118-

ttsfm-web/app.py

Lines changed: 13 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -42,25 +42,23 @@
4242

4343
# Import the TTSFM package
4444
try:
45-
from ttsfm import AudioFormat, TTSClient, TTSException, Voice
45+
from ttsfm import AudioFormat, TTSClient, Voice
4646
from ttsfm.audio import combine_audio_chunks
4747
from ttsfm.exceptions import (
4848
APIException,
4949
AudioProcessingException,
5050
NetworkException,
5151
ValidationException,
5252
)
53-
from ttsfm.models import get_supported_format
5453
from ttsfm.utils import split_text_by_length
5554
except ImportError:
5655
# Fallback for development when package is not installed
5756
import sys
5857

5958
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
60-
from ttsfm import AudioFormat, TTSClient, TTSException, Voice
59+
from ttsfm import AudioFormat, TTSClient, Voice
6160
from ttsfm.audio import combine_audio_chunks
6261
from ttsfm.exceptions import APIException, NetworkException, ValidationException
63-
from ttsfm.models import get_supported_format
6462
from ttsfm.utils import split_text_by_length
6563

6664
# Load environment variables
@@ -486,10 +484,6 @@ def validate_text():
486484
return jsonify({"error": "Text validation failed"}), 500
487485

488486

489-
490-
491-
492-
493487
@app.route("/api/status", methods=["GET"])
494488
def get_status():
495489
"""Get service status."""
@@ -503,7 +497,7 @@ def get_status():
503497
{
504498
"status": "online",
505499
"tts_service": "openai.fm (free)",
506-
"package_version": "3.4.0a3",
500+
"package_version": "3.4.0a4",
507501
"timestamp": datetime.now().isoformat(),
508502
}
509503
)
@@ -527,7 +521,7 @@ def get_status():
527521
def health_check():
528522
"""Simple health check endpoint."""
529523
return jsonify(
530-
{"status": "healthy", "package_version": "3.4.0a3", "timestamp": datetime.now().isoformat()}
524+
{"status": "healthy", "package_version": "3.4.0a4", "timestamp": datetime.now().isoformat()}
531525
)
532526

533527

@@ -694,15 +688,12 @@ def openai_speech():
694688
400,
695689
)
696690

697-
effective_format = get_supported_format(format_enum)
698-
699691
logger.info(
700692
"OpenAI API: Generating speech: text='%s...', voice=%s, "
701-
"requested_format=%s (effective=%s), auto_combine=%s, speed=%s",
693+
"requested_format=%s, auto_combine=%s, speed=%s",
702694
input_text[:50],
703695
voice,
704696
response_format,
705-
effective_format.value,
706697
auto_combine,
707698
speed,
708699
)
@@ -715,14 +706,14 @@ def openai_speech():
715706
logger.info(
716707
"Long text detected (%s chars); auto-combining with format %s",
717708
len(input_text),
718-
effective_format.value,
709+
format_enum.value,
719710
)
720711

721712
# Generate speech chunks
722713
responses = client.generate_speech_long_text(
723714
text=input_text,
724715
voice=voice_enum,
725-
response_format=effective_format,
716+
response_format=format_enum,
726717
instructions=instructions,
727718
max_length=max_length,
728719
preserve_words=True,
@@ -778,13 +769,14 @@ def openai_speech():
778769
"X-Auto-Combine": "true",
779770
"X-Powered-By": "TTSFM-OpenAI-Compatible",
780771
"X-Requested-Format": format_enum.value,
781-
"X-Effective-Format": effective_format.value,
782772
}
783773

784774
# Add speed metadata if available (from first response)
785775
if responses and responses[0].metadata and "requested_speed" in responses[0].metadata:
786776
headers["X-Requested-Speed"] = str(responses[0].metadata["requested_speed"])
787-
headers["X-Speed-Applied"] = str(responses[0].metadata.get("speed_applied", False)).lower()
777+
headers["X-Speed-Applied"] = str(
778+
responses[0].metadata.get("speed_applied", False)
779+
).lower()
788780

789781
return Response(
790782
stream_with_context(_chunk_bytes(combined_audio)),
@@ -834,13 +826,14 @@ def openai_speech():
834826
"X-Auto-Combine": str(auto_combine).lower(),
835827
"X-Powered-By": "TTSFM-OpenAI-Compatible",
836828
"X-Requested-Format": format_enum.value,
837-
"X-Effective-Format": effective_format.value,
838829
}
839830

840831
# Add speed metadata if available
841832
if response.metadata and "requested_speed" in response.metadata:
842833
headers["X-Requested-Speed"] = str(response.metadata["requested_speed"])
843-
headers["X-Speed-Applied"] = str(response.metadata.get("speed_applied", False)).lower()
834+
headers["X-Speed-Applied"] = str(
835+
response.metadata.get("speed_applied", False)
836+
).lower()
844837

845838
return Response(
846839
stream_with_context(_chunk_bytes(response.audio_data)),

ttsfm-web/static/js/playground-enhanced-fixed.js

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -652,10 +652,17 @@ const PlaygroundApp = (() => {
652652

653653
async function loadFormats({ refresh = false } = {}) {
654654
try {
655-
const data = await fetchFormats({ refresh });
656655
if (!els.formatSelect) {
657656
return;
658657
}
658+
659+
// If format select already has options (from HTML), don't reload
660+
if (els.formatSelect.options.length > 1 && !refresh) {
661+
updateAudioSummary();
662+
return;
663+
}
664+
665+
const data = await fetchFormats({ refresh });
659666
els.formatSelect.innerHTML = '';
660667
data.formats.forEach((format) => {
661668
const option = document.createElement('option');

ttsfm-web/templates/base.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@
8888
<a class="navbar-brand" href="{{ url_for('index') }}">
8989
<i class="fas fa-microphone-alt me-2"></i>
9090
<span class="fw-bold">TTSFM</span>
91-
<span class="badge bg-primary ms-2 small">v3.4.0-alpha3</span>
91+
<span class="badge bg-primary ms-2 small">v3.4.0-alpha4</span>
9292
</a>
9393

9494
<button class="navbar-toggler border-0" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav" aria-controls="navbarNav" aria-expanded="false" aria-label="Toggle navigation">
@@ -159,7 +159,7 @@
159159
<div class="d-flex align-items-center">
160160
<i class="fas fa-microphone-alt me-2 text-primary"></i>
161161
<strong class="text-dark">TTSFM</strong>
162-
<span class="ms-2 text-muted">v3.4.0-alpha3</span>
162+
<span class="ms-2 text-muted">v3.4.0-alpha4</span>
163163
</div>
164164
</div>
165165
<div class="col-md-6 text-md-end">

ttsfm-web/templates/playground.html

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ <h4 class="mb-0 d-flex align-items-center">
9595

9696
<div class="row">
9797
<!-- Enhanced Voice Selection -->
98-
<div class="col-md-12 mb-4">
98+
<div class="col-md-6 mb-4">
9999
<label for="voice-select" class="form-label fw-bold d-flex align-items-center">
100100
<i class="fas fa-microphone me-2 text-primary"></i>
101101
{{ _('playground.voice_label') }}
@@ -107,6 +107,26 @@ <h4 class="mb-0 d-flex align-items-center">
107107
<span>{{ _('common.choose_voice') }}</span>
108108
</div>
109109
</div>
110+
111+
<!-- Format Selection -->
112+
<div class="col-md-6 mb-4">
113+
<label for="format-select" class="form-label fw-bold d-flex align-items-center">
114+
<i class="fas fa-file-audio me-2 text-primary"></i>
115+
{{ _('playground.format_label') if _('playground.format_label') != 'playground.format_label' else 'Audio Format' }}
116+
</label>
117+
<select class="form-select shadow-sm" id="format-select" required>
118+
<option value="mp3" selected>mp3</option>
119+
<option value="wav">wav</option>
120+
<option value="opus">opus</option>
121+
<option value="aac">aac</option>
122+
<option value="flac">flac</option>
123+
<option value="pcm">pcm</option>
124+
</select>
125+
<div class="form-text">
126+
<i class="fas fa-info-circle me-1"></i>
127+
{{ _('playground.format_description') if _('playground.format_description') != 'playground.format_description' else 'Choose audio output format. Converted formats require ffmpeg.' }}
128+
</div>
129+
</div>
110130
</div>
111131

112132
<!-- Advanced Options -->

ttsfm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@
6262
)
6363
from .utils import split_text_by_length, validate_text_length
6464

65-
__version__ = "3.3.7"
65+
__version__ = "3.4.0-alpha4"
6666
__author__ = "dbcccc"
6767
__email__ = "[email protected]"
6868
__description__ = "Text-to-Speech API Client with OpenAI compatibility"

0 commit comments

Comments
 (0)