Skip to content
This repository was archived by the owner on Nov 15, 2025. It is now read-only.
This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Develop audio transcription and embedding pipeline #31

@thewildofficial

Description

@thewildofficial

Summary

Implement audio intelligence pipeline with VAD-driven segmentation, speech/non-speech classification, faster-whisper transcription, and spectrogram embeddings for music/ambient audio, as specified in docs/technical_specification.md § Media Processing Pipeline (Audio sections).

Scope

Audio Intelligence Pipeline (Phase 5 - Future Enhancement)

Implement enhanced audio processing from docs/technical_specification.md § Media Processing Pipeline:

Module: src/audio/processor.py, src/audio/transcriber.py, src/audio/embedder.py

Tasks

1. Audio Classification & VAD (src/audio/processor.py)

Voice Activity Detection:

def detect_voice_activity(audio_path: str) -> List[dict]:
    """
    Detect speech segments using VAD.
    
    Returns:
        [
            {
                'start_time': float,  # seconds
                'end_time': float,
                'duration': float,
                'has_speech': bool,
                'confidence': float
            },
            ...
        ]
    """
    import webrtcvad
    
    # Load audio
    audio, sr = librosa.load(audio_path, sr=16000, mono=True)
    
    # Initialize VAD
    vad = webrtcvad.Vad(mode=3)  # Aggressive mode
    
    # Process in 30ms frames
    frame_duration = 30  # ms
    frame_length = int(sr * frame_duration / 1000)
    
    segments = []
    is_speech = False
    segment_start = 0
    
    for i in range(0, len(audio), frame_length):
        frame = audio[i:i+frame_length]
        
        if len(frame) < frame_length:
            break
        
        # Convert to 16-bit PCM
        frame_bytes = (frame * 32767).astype(np.int16).tobytes()
        
        # Detect speech
        has_speech = vad.is_speech(frame_bytes, sr)
        
        # Track segment boundaries
        if has_speech and not is_speech:
            segment_start = i / sr
            is_speech = True
        elif not has_speech and is_speech:
            segments.append({
                'start_time': segment_start,
                'end_time': i / sr,
                'duration': (i / sr) - segment_start,
                'has_speech': True,
                'confidence': 0.9  # VAD is high confidence
            })
            is_speech = False
    
    return segments

Speech vs Music vs Noise Classification:

def classify_audio_segments(audio_path: str, segments: List[dict]) -> List[dict]:
    """
    Classify each segment as speech, music, or noise.
    
    Uses heuristics:
    - Speech: High zero-crossing rate, speech-like formants, VAD positive
    - Music: Harmonic structure, steady rhythm, tonal content
    - Noise: Random spectrum, low harmonic content
    """
    audio, sr = librosa.load(audio_path, sr=22050)
    
    for segment in segments:
        start_sample = int(segment['start_time'] * sr)
        end_sample = int(segment['end_time'] * sr)
        segment_audio = audio[start_sample:end_sample]
        
        # Compute features
        zcr = librosa.feature.zero_crossing_rate(segment_audio)[0].mean()
        spectral_centroid = librosa.feature.spectral_centroid(y=segment_audio, sr=sr)[0].mean()
        spectral_rolloff = librosa.feature.spectral_rolloff(y=segment_audio, sr=sr)[0].mean()
        mfcc = librosa.feature.mfcc(y=segment_audio, sr=sr, n_mfcc=13)
        
        # Classification heuristic
        if segment['has_speech']:
            kind = 'speech'
        elif spectral_centroid > 2000 and zcr < 0.05:
            kind = 'music'
        else:
            kind = 'noise'
        
        segment['kind'] = kind
        segment['features'] = {
            'zero_crossing_rate': float(zcr),
            'spectral_centroid': float(spectral_centroid),
            'spectral_rolloff': float(spectral_rolloff),
            'mfcc_mean': mfcc.mean(axis=1).tolist()
        }
    
    return segments

2. Speech Transcription (src/audio/transcriber.py)

Faster-Whisper Integration:

from faster_whisper import WhisperModel

class AudioTranscriber:
    def __init__(self, model_size: str = "base"):
        """
        Initialize Faster-Whisper model.
        
        Models: tiny, base, small, medium, large-v2
        """
        self.model = WhisperModel(
            model_size,
            device="cpu",  # or "cuda" for GPU
            compute_type="int8"  # Quantization for speed
        )
    
    def transcribe_segment(
        self,
        audio_path: str,
        start_time: float,
        end_time: float,
        language: Optional[str] = None
    ) -> dict:
        """
        Transcribe audio segment with diarization.
        
        Returns:
            {
                'text': str,
                'language': str,
                'segments': [
                    {
                        'start': float,
                        'end': float,
                        'text': str,
                        'confidence': float,
                        'speaker': Optional[str]
                    },
                    ...
                ]
            }
        """
        # Extract segment
        audio, sr = librosa.load(audio_path, sr=16000, offset=start_time, duration=end_time-start_time)
        
        # Save temporary segment
        temp_path = f"/tmp/segment_{uuid.uuid4()}.wav"
        sf.write(temp_path, audio, sr)
        
        # Transcribe
        segments, info = self.model.transcribe(
            temp_path,
            language=language,
            beam_size=5,
            word_timestamps=True
        )
        
        # Collect results
        transcript_segments = []
        full_text = []
        
        for segment in segments:
            transcript_segments.append({
                'start': segment.start,
                'end': segment.end,
                'text': segment.text.strip(),
                'confidence': segment.avg_logprob,
                'words': [
                    {
                        'word': w.word,
                        'start': w.start,
                        'end': w.end,
                        'confidence': w.probability
                    }
                    for w in segment.words
                ] if segment.words else []
            })
            full_text.append(segment.text.strip())
        
        # Clean up temp file
        os.remove(temp_path)
        
        return {
            'text': ' '.join(full_text),
            'language': info.language,
            'language_probability': info.language_probability,
            'segments': transcript_segments
        }

Transcript Embedding Generation:

def generate_transcript_embeddings(transcript: str) -> np.ndarray:
    """
    Generate 768-d text embeddings for transcript.
    
    Uses sentence-transformers for consistency with document pipeline.
    """
    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-d, fast
    # or 'all-mpnet-base-v2' for 768-d, higher quality
    
    embedding = model.encode(
        transcript,
        normalize_embeddings=True,
        convert_to_tensor=False
    )
    
    return embedding

3. Music & Ambient Audio Processing (src/audio/embedder.py)

Spectrogram Generation:

def generate_spectrogram_image(audio_path: str, segment: dict) -> Image.Image:
    """
    Generate mel-spectrogram visualization for non-speech audio.
    
    Returns PIL Image that can be encoded with CLIP.
    """
    audio, sr = librosa.load(
        audio_path,
        sr=22050,
        offset=segment['start_time'],
        duration=segment['duration']
    )
    
    # Compute mel-spectrogram
    mel_spec = librosa.feature.melspectrogram(
        y=audio,
        sr=sr,
        n_mels=128,
        fmax=8000
    )
    
    # Convert to dB scale
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    
    # Normalize to 0-255 for image
    mel_spec_norm = ((mel_spec_db - mel_spec_db.min()) / (mel_spec_db.max() - mel_spec_db.min()) * 255).astype(np.uint8)
    
    # Convert to RGB image
    image = Image.fromarray(mel_spec_norm).convert('RGB')
    
    return image

Spectrogram Embedding with CLIP:

def encode_spectrogram(spectrogram_image: Image.Image) -> np.ndarray:
    """
    Encode spectrogram as 512-d CLIP embedding.
    
    Enables semantic search for music/ambient audio.
    """
    from sentence_transformers import SentenceTransformer
    
    clip_model = SentenceTransformer('clip-ViT-B-32')
    
    embedding = clip_model.encode(
        spectrogram_image,
        normalize_embeddings=True,
        convert_to_tensor=False
    )
    
    return embedding

Genre & Mood Classification:

def classify_music_metadata(audio_path: str, segment: dict) -> dict:
    """
    Extract genre, mood, and tempo for music segments.
    
    Uses librosa features + heuristics.
    """
    audio, sr = librosa.load(
        audio_path,
        offset=segment['start_time'],
        duration=segment['duration']
    )
    
    # Tempo detection
    tempo, beat_frames = librosa.beat.beat_track(y=audio, sr=sr)
    
    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)[0].mean()
    spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)[0].mean()
    
    # Chroma features (tonality)
    chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
    chroma_mean = chroma.mean(axis=1)
    
    # Simple heuristic classification
    if tempo > 140:
        mood = 'energetic'
    elif tempo < 80:
        mood = 'calm'
    else:
        mood = 'moderate'
    
    if spectral_centroid > 3000:
        genre_hint = 'electronic'
    elif spectral_rolloff < 2000:
        genre_hint = 'acoustic'
    else:
        genre_hint = 'mixed'
    
    return {
        'tempo': float(tempo),
        'mood': mood,
        'genre_hint': genre_hint,
        'spectral_centroid': float(spectral_centroid),
        'spectral_rolloff': float(spectral_rolloff),
        'tonal_centroid': chroma_mean.tolist()
    }

4. Database Schema

Implement tables from docs/technical_specification.md § Data Models & Database Schema:

-- Audio segments
CREATE TABLE audio_segment (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_id UUID REFERENCES asset(id) NOT NULL,
    segment_index INTEGER NOT NULL,
    start_time REAL NOT NULL,
    end_time REAL NOT NULL,
    kind TEXT CHECK (kind IN ('speech', 'music', 'noise')) NOT NULL,
    embedding vector(512),  -- Spectrogram embedding for music/noise
    metadata JSONB,  -- VAD features, classification scores
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(asset_id, segment_index)
);

CREATE INDEX idx_audio_segment_asset_id ON audio_segment(asset_id);
CREATE INDEX idx_audio_segment_embedding_hnsw ON audio_segment USING hnsw (embedding vector_cosine_ops);

-- Audio transcripts
CREATE TABLE audio_transcript (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_id UUID REFERENCES asset(id) NOT NULL,
    segment_id UUID REFERENCES audio_segment(id),
    text TEXT NOT NULL,
    text_embedding vector(768) NOT NULL,  -- Text embedding
    language TEXT,
    speaker_label TEXT,  -- Diarization label
    confidence REAL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_audio_transcript_asset_id ON audio_transcript(asset_id);
CREATE INDEX idx_audio_transcript_text_embedding_hnsw ON audio_transcript USING hnsw (text_embedding vector_cosine_ops);

5. Search Integration

Enable multi-modal audio search:

  1. Text Query → Transcript Search:

    • Encode query with text model (768-d)
    • Search audio_transcript.text_embedding
    • Return matching segments with timestamps
  2. Text Query → Music Search:

    • Encode query with CLIP text encoder (512-d)
    • Search audio_segment.embedding (spectrograms)
    • Support queries like "upbeat electronic music"

Example Search Response:

{
    "results": [
        {
            "id": "transcript-uuid-1",
            "modality": "audio",
            "kind": "speech",
            "similarity": 0.88,
            "text": "In this tutorial, we'll learn about machine learning...",
            "parent_asset": {
                "id": "asset-uuid-1",
                "title": "ML Lecture 5.mp3"
            },
            "segment": {
                "start_time": 42.5,
                "end_time": 67.3,
                "speaker": "Speaker 1"
            }
        }
    ]
}

Technology Stack

Add to requirements.txt:

  • faster-whisper 0.10.0: Fast speech transcription
  • webrtcvad 2.0.10: Voice activity detection
  • librosa 0.10.1: Audio analysis and feature extraction
  • soundfile 0.12.1: Audio file I/O
  • pyannote.audio 3.1.0: Speaker diarization (optional)

Configuration

Add to settings.py:

# Audio Processing
AUDIO_TRANSCRIPTION_ENABLED = False  # Feature flag
WHISPER_MODEL_SIZE = "base"  # tiny, base, small, medium, large-v2
WHISPER_LANGUAGE = None  # Auto-detect if None
VAD_AGGRESSIVENESS = 3  # 0-3, higher = more aggressive
AUDIO_MIN_SEGMENT_DURATION = 1.0  # seconds
AUDIO_MAX_SEGMENT_DURATION = 30.0  # seconds
MUSIC_GENRE_CLASSIFICATION = True
SPECTROGRAM_EMBEDDING_ENABLED = True
TRANSCRIPT_MIN_CONFIDENCE = 0.6

Acceptance Criteria

  • VAD segments audio into speech/non-speech regions
  • Speech segments transcribed with faster-whisper
  • Transcripts stored in audio_transcript with 768-d embeddings
  • Music/noise segments generate spectrogram embeddings (512-d)
  • Genre and mood metadata extracted for music segments
  • Search supports both transcript text and spectrogram similarity
  • Configuration flags control transcription model and VAD settings
  • Speaker diarization labels attached to transcript segments
  • E2E test: upload podcast → segments detected → speech transcribed → searchable by keywords
  • E2E test: upload music track → spectrogram embedded → searchable by mood/genre

Testing Requirements

  • Unit test: VAD segment detection
  • Unit test: speech/music/noise classification
  • Unit test: faster-whisper transcription
  • Unit test: spectrogram generation
  • Unit test: genre/mood classification
  • Integration test: full pipeline from audio upload to transcript search
  • Performance test: transcription time for 10-minute audio file

References

  • docs/technical_specification.md § Media Processing Pipeline (lines 772-776, 798-802)
  • docs/technical_specification.md § Data Models & Database Schema (lines 1992-2032)
  • docs/technical_specification.md § Implementation Checklist Phase 5 (lines 2346-2352)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions