-
Notifications
You must be signed in to change notification settings - Fork 0
Develop audio transcription and embedding pipeline #31
Description
Summary
Implement audio intelligence pipeline with VAD-driven segmentation, speech/non-speech classification, faster-whisper transcription, and spectrogram embeddings for music/ambient audio, as specified in docs/technical_specification.md § Media Processing Pipeline (Audio sections).
Scope
Audio Intelligence Pipeline (Phase 5 - Future Enhancement)
Implement enhanced audio processing from docs/technical_specification.md § Media Processing Pipeline:
Module: src/audio/processor.py, src/audio/transcriber.py, src/audio/embedder.py
Tasks
1. Audio Classification & VAD (src/audio/processor.py)
Voice Activity Detection:
def detect_voice_activity(audio_path: str) -> List[dict]:
"""
Detect speech segments using VAD.
Returns:
[
{
'start_time': float, # seconds
'end_time': float,
'duration': float,
'has_speech': bool,
'confidence': float
},
...
]
"""
import webrtcvad
# Load audio
audio, sr = librosa.load(audio_path, sr=16000, mono=True)
# Initialize VAD
vad = webrtcvad.Vad(mode=3) # Aggressive mode
# Process in 30ms frames
frame_duration = 30 # ms
frame_length = int(sr * frame_duration / 1000)
segments = []
is_speech = False
segment_start = 0
for i in range(0, len(audio), frame_length):
frame = audio[i:i+frame_length]
if len(frame) < frame_length:
break
# Convert to 16-bit PCM
frame_bytes = (frame * 32767).astype(np.int16).tobytes()
# Detect speech
has_speech = vad.is_speech(frame_bytes, sr)
# Track segment boundaries
if has_speech and not is_speech:
segment_start = i / sr
is_speech = True
elif not has_speech and is_speech:
segments.append({
'start_time': segment_start,
'end_time': i / sr,
'duration': (i / sr) - segment_start,
'has_speech': True,
'confidence': 0.9 # VAD is high confidence
})
is_speech = False
return segmentsSpeech vs Music vs Noise Classification:
def classify_audio_segments(audio_path: str, segments: List[dict]) -> List[dict]:
"""
Classify each segment as speech, music, or noise.
Uses heuristics:
- Speech: High zero-crossing rate, speech-like formants, VAD positive
- Music: Harmonic structure, steady rhythm, tonal content
- Noise: Random spectrum, low harmonic content
"""
audio, sr = librosa.load(audio_path, sr=22050)
for segment in segments:
start_sample = int(segment['start_time'] * sr)
end_sample = int(segment['end_time'] * sr)
segment_audio = audio[start_sample:end_sample]
# Compute features
zcr = librosa.feature.zero_crossing_rate(segment_audio)[0].mean()
spectral_centroid = librosa.feature.spectral_centroid(y=segment_audio, sr=sr)[0].mean()
spectral_rolloff = librosa.feature.spectral_rolloff(y=segment_audio, sr=sr)[0].mean()
mfcc = librosa.feature.mfcc(y=segment_audio, sr=sr, n_mfcc=13)
# Classification heuristic
if segment['has_speech']:
kind = 'speech'
elif spectral_centroid > 2000 and zcr < 0.05:
kind = 'music'
else:
kind = 'noise'
segment['kind'] = kind
segment['features'] = {
'zero_crossing_rate': float(zcr),
'spectral_centroid': float(spectral_centroid),
'spectral_rolloff': float(spectral_rolloff),
'mfcc_mean': mfcc.mean(axis=1).tolist()
}
return segments2. Speech Transcription (src/audio/transcriber.py)
Faster-Whisper Integration:
from faster_whisper import WhisperModel
class AudioTranscriber:
def __init__(self, model_size: str = "base"):
"""
Initialize Faster-Whisper model.
Models: tiny, base, small, medium, large-v2
"""
self.model = WhisperModel(
model_size,
device="cpu", # or "cuda" for GPU
compute_type="int8" # Quantization for speed
)
def transcribe_segment(
self,
audio_path: str,
start_time: float,
end_time: float,
language: Optional[str] = None
) -> dict:
"""
Transcribe audio segment with diarization.
Returns:
{
'text': str,
'language': str,
'segments': [
{
'start': float,
'end': float,
'text': str,
'confidence': float,
'speaker': Optional[str]
},
...
]
}
"""
# Extract segment
audio, sr = librosa.load(audio_path, sr=16000, offset=start_time, duration=end_time-start_time)
# Save temporary segment
temp_path = f"/tmp/segment_{uuid.uuid4()}.wav"
sf.write(temp_path, audio, sr)
# Transcribe
segments, info = self.model.transcribe(
temp_path,
language=language,
beam_size=5,
word_timestamps=True
)
# Collect results
transcript_segments = []
full_text = []
for segment in segments:
transcript_segments.append({
'start': segment.start,
'end': segment.end,
'text': segment.text.strip(),
'confidence': segment.avg_logprob,
'words': [
{
'word': w.word,
'start': w.start,
'end': w.end,
'confidence': w.probability
}
for w in segment.words
] if segment.words else []
})
full_text.append(segment.text.strip())
# Clean up temp file
os.remove(temp_path)
return {
'text': ' '.join(full_text),
'language': info.language,
'language_probability': info.language_probability,
'segments': transcript_segments
}Transcript Embedding Generation:
def generate_transcript_embeddings(transcript: str) -> np.ndarray:
"""
Generate 768-d text embeddings for transcript.
Uses sentence-transformers for consistency with document pipeline.
"""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-d, fast
# or 'all-mpnet-base-v2' for 768-d, higher quality
embedding = model.encode(
transcript,
normalize_embeddings=True,
convert_to_tensor=False
)
return embedding3. Music & Ambient Audio Processing (src/audio/embedder.py)
Spectrogram Generation:
def generate_spectrogram_image(audio_path: str, segment: dict) -> Image.Image:
"""
Generate mel-spectrogram visualization for non-speech audio.
Returns PIL Image that can be encoded with CLIP.
"""
audio, sr = librosa.load(
audio_path,
sr=22050,
offset=segment['start_time'],
duration=segment['duration']
)
# Compute mel-spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=sr,
n_mels=128,
fmax=8000
)
# Convert to dB scale
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
# Normalize to 0-255 for image
mel_spec_norm = ((mel_spec_db - mel_spec_db.min()) / (mel_spec_db.max() - mel_spec_db.min()) * 255).astype(np.uint8)
# Convert to RGB image
image = Image.fromarray(mel_spec_norm).convert('RGB')
return imageSpectrogram Embedding with CLIP:
def encode_spectrogram(spectrogram_image: Image.Image) -> np.ndarray:
"""
Encode spectrogram as 512-d CLIP embedding.
Enables semantic search for music/ambient audio.
"""
from sentence_transformers import SentenceTransformer
clip_model = SentenceTransformer('clip-ViT-B-32')
embedding = clip_model.encode(
spectrogram_image,
normalize_embeddings=True,
convert_to_tensor=False
)
return embeddingGenre & Mood Classification:
def classify_music_metadata(audio_path: str, segment: dict) -> dict:
"""
Extract genre, mood, and tempo for music segments.
Uses librosa features + heuristics.
"""
audio, sr = librosa.load(
audio_path,
offset=segment['start_time'],
duration=segment['duration']
)
# Tempo detection
tempo, beat_frames = librosa.beat.beat_track(y=audio, sr=sr)
# Spectral features
spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)[0].mean()
spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)[0].mean()
# Chroma features (tonality)
chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
chroma_mean = chroma.mean(axis=1)
# Simple heuristic classification
if tempo > 140:
mood = 'energetic'
elif tempo < 80:
mood = 'calm'
else:
mood = 'moderate'
if spectral_centroid > 3000:
genre_hint = 'electronic'
elif spectral_rolloff < 2000:
genre_hint = 'acoustic'
else:
genre_hint = 'mixed'
return {
'tempo': float(tempo),
'mood': mood,
'genre_hint': genre_hint,
'spectral_centroid': float(spectral_centroid),
'spectral_rolloff': float(spectral_rolloff),
'tonal_centroid': chroma_mean.tolist()
}4. Database Schema
Implement tables from docs/technical_specification.md § Data Models & Database Schema:
-- Audio segments
CREATE TABLE audio_segment (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
asset_id UUID REFERENCES asset(id) NOT NULL,
segment_index INTEGER NOT NULL,
start_time REAL NOT NULL,
end_time REAL NOT NULL,
kind TEXT CHECK (kind IN ('speech', 'music', 'noise')) NOT NULL,
embedding vector(512), -- Spectrogram embedding for music/noise
metadata JSONB, -- VAD features, classification scores
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(asset_id, segment_index)
);
CREATE INDEX idx_audio_segment_asset_id ON audio_segment(asset_id);
CREATE INDEX idx_audio_segment_embedding_hnsw ON audio_segment USING hnsw (embedding vector_cosine_ops);
-- Audio transcripts
CREATE TABLE audio_transcript (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
asset_id UUID REFERENCES asset(id) NOT NULL,
segment_id UUID REFERENCES audio_segment(id),
text TEXT NOT NULL,
text_embedding vector(768) NOT NULL, -- Text embedding
language TEXT,
speaker_label TEXT, -- Diarization label
confidence REAL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_audio_transcript_asset_id ON audio_transcript(asset_id);
CREATE INDEX idx_audio_transcript_text_embedding_hnsw ON audio_transcript USING hnsw (text_embedding vector_cosine_ops);5. Search Integration
Enable multi-modal audio search:
-
Text Query → Transcript Search:
- Encode query with text model (768-d)
- Search
audio_transcript.text_embedding - Return matching segments with timestamps
-
Text Query → Music Search:
- Encode query with CLIP text encoder (512-d)
- Search
audio_segment.embedding(spectrograms) - Support queries like "upbeat electronic music"
Example Search Response:
{
"results": [
{
"id": "transcript-uuid-1",
"modality": "audio",
"kind": "speech",
"similarity": 0.88,
"text": "In this tutorial, we'll learn about machine learning...",
"parent_asset": {
"id": "asset-uuid-1",
"title": "ML Lecture 5.mp3"
},
"segment": {
"start_time": 42.5,
"end_time": 67.3,
"speaker": "Speaker 1"
}
}
]
}Technology Stack
Add to requirements.txt:
faster-whisper 0.10.0: Fast speech transcriptionwebrtcvad 2.0.10: Voice activity detectionlibrosa 0.10.1: Audio analysis and feature extractionsoundfile 0.12.1: Audio file I/Opyannote.audio 3.1.0: Speaker diarization (optional)
Configuration
Add to settings.py:
# Audio Processing
AUDIO_TRANSCRIPTION_ENABLED = False # Feature flag
WHISPER_MODEL_SIZE = "base" # tiny, base, small, medium, large-v2
WHISPER_LANGUAGE = None # Auto-detect if None
VAD_AGGRESSIVENESS = 3 # 0-3, higher = more aggressive
AUDIO_MIN_SEGMENT_DURATION = 1.0 # seconds
AUDIO_MAX_SEGMENT_DURATION = 30.0 # seconds
MUSIC_GENRE_CLASSIFICATION = True
SPECTROGRAM_EMBEDDING_ENABLED = True
TRANSCRIPT_MIN_CONFIDENCE = 0.6Acceptance Criteria
- VAD segments audio into speech/non-speech regions
- Speech segments transcribed with faster-whisper
- Transcripts stored in
audio_transcriptwith 768-d embeddings - Music/noise segments generate spectrogram embeddings (512-d)
- Genre and mood metadata extracted for music segments
- Search supports both transcript text and spectrogram similarity
- Configuration flags control transcription model and VAD settings
- Speaker diarization labels attached to transcript segments
- E2E test: upload podcast → segments detected → speech transcribed → searchable by keywords
- E2E test: upload music track → spectrogram embedded → searchable by mood/genre
Testing Requirements
- Unit test: VAD segment detection
- Unit test: speech/music/noise classification
- Unit test: faster-whisper transcription
- Unit test: spectrogram generation
- Unit test: genre/mood classification
- Integration test: full pipeline from audio upload to transcript search
- Performance test: transcription time for 10-minute audio file
References
- docs/technical_specification.md § Media Processing Pipeline (lines 772-776, 798-802)
- docs/technical_specification.md § Data Models & Database Schema (lines 1992-2032)
- docs/technical_specification.md § Implementation Checklist Phase 5 (lines 2346-2352)