This is a Runpod serverless worker for audio transcription using NVIDIA's Parakeet ASR model. The service accepts audio files (URLs or local paths) and returns transcribed text with optional timestamps.
- handler.py: Main serverless worker implementation
handler(): Entry point that processes transcription requeststranscribe_batched(): Smart batching strategy for efficient GPU utilization_fetch_to_wav(): Converts any audio format to 16kHz mono WAV using ffmpeg_maybe_set_local_attention(): Dynamically switches attention mechanism based on audio length_bucketize_by_duration(): Separates short vs long audio files_make_batches(): Packs short audio files into efficient batches
The service implements a hybrid processing approach:
- Short audio (≤10 min by default): Batched together for parallel GPU processing
- Long audio (>10 min): Processed sequentially to manage memory
- Batch constraints: Max 16 items or 20 minutes total duration per batch
- Results are returned in the original input order
The service uses conditional model loading based on environment:
- Production (
SKIP_MODEL_LOAD=0): Loads NVIDIA Parakeet TDT model via NeMo - Development (
SKIP_MODEL_LOAD=1): Dry-run mode without loading heavy dependencies
- Accepts inputs via Runpod event format:
{"input": {"timestamps": bool, "inputs": [{"source": "audio_url"}]}} - Pre-converts all audio to WAV format and measures durations
- Bucketizes into short/long based on duration threshold
- Processes shorts in batches, longs sequentially
- Returns transcriptions with duration and optional word/segment timestamps in original order
# Test without loading model (macOS/local dev)
export SKIP_MODEL_LOAD=1
python test_local.py
# Or run handler directly
export SKIP_MODEL_LOAD=1
python handler.py# Build container
docker build -t parakeet-worker .
# Run locally with dry-run mode
docker run -e SKIP_MODEL_LOAD=1 -e RUNPOD_SERVERLESS=0 parakeet-worker
# Run with model loaded (requires NVIDIA GPU)
docker run --gpus all -e RUNPOD_SERVERLESS=0 parakeet-workerpip install -r requirements.txt
# For full model support (requires CUDA):
pip install "nemo_toolkit[asr]>=2.4.0" soundfile librosaPARAKEET_MODEL: ASR model name (default: "nvidia/parakeet-tdt-0.6b-v3")SKIP_MODEL_LOAD: Set to "1" to skip model loading for local developmentRUNPOD_SERVERLESS: Set to "1" to start Runpod serverless workerSHORT_MAX_SEC: Duration threshold for batching (default: 600 seconds / 10 minutes)BATCH_MAX_ITEMS: Maximum items per batch (default: 16)BATCH_MAX_TOTAL_SEC: Maximum total duration per batch (default: 1200 seconds / 20 minutes)LOCAL_ATTENTION_AFTER_SEC: Switch to local attention after this duration (default: 1440 seconds / 24 minutes)HF_HOME/TRANSFORMERS_CACHE: Hugging Face cache directories
- The service dynamically switches between global and local attention based on audio duration
- Short audio files are batched for efficient GPU utilization while long files are processed sequentially
- All audio is converted to 16kHz mono WAV format for consistent processing
- Temporary WAV files are cleaned up after processing
- ffmpeg and ffprobe are required system dependencies
- Results maintain original input order regardless of batching