Whisper Transcriber

A Python tool that transcribes audio files using OpenAI's Whisper large-v3 model. Supports multiple file formats and includes timestamp functionality.

Features

Uses OpenAI's Whisper large-v3 model for high-quality transcription
Supports multiple audio formats (MP3, WAV, FLAC, AAC, OGG, M4A, WMA)
Multilingual transcription support
Optional timestamp generation
Batch processing with wildcard support
Configurable timeout for processing
Outputs transcriptions in Markdown format

Setup

Clone or download this repository
Run the setup script:
```
chmod +x setup.sh
./setup.sh
```

Or manually:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Basic Usage

source venv/bin/activate
python3 convert.py -i input_audio.mp3 -o output_folder/

Advanced Usage

# With timestamps and verbose output
python3 convert.py -i "audio file.mp3" -o "./output folder/" -ts -v

# Multiple files with wildcards
python3 convert.py -i *.mp3 -o ./output/ -ts

# Multiple specific files
python3 convert.py -i file1.wav file2.mp3 file3.flac -o ./output/

# With timeout (300 seconds per file)
python3 convert.py -i *.mp3 -o ./output/ --timeout 300

# With chunked processing (60-second chunks) and Flash Attention
python3 convert.py -i audio.mp3 -o ./output/ -ch 60 --flash-attn

# Default chunked processing (30-second chunks)
python3 convert.py -i *.wav -o ./output/ -ch -ts -v

# With translation to English
python3 convert.py -i foreign_audio.mp3 -o ./output/ -tr en -ts -v

# Translate Spanish audio to English with Flash Attention
python3 convert.py -i spanish_lecture.mp3 -o ./output/ -tr en --flash-attn

# With automatic enhancement using Gemini API
python3 convert.py -i audio.mp3 -o ./output/ -ts -e

# With custom enhancement prompt
python3 convert.py -i lecture.mp3 -o ./output/ -e "Focus on technical terms and improve structure"

Parameters

Required Parameters

-i, --input: Input audio file path(s). Supports wildcards and multiple files
-o, --output: Output directory path for markdown files

Optional Parameters

-ts, --timestamp: Enable timestamp feature in transcription
-v, --verbose: Enable verbose output to see processing details
-to, --timeout: Set timeout in seconds for each file processing
-ch, --chunked: Enable chunked long-form processing with specified chunk length in seconds (default: 30)
--flash-attn: Enable Flash Attention 2 for faster processing on compatible GPUs
-tr, --translate: Set target language for translation using ISO 639-1 two-letter codes (e.g., "en", "es", "fr")
-e, --enhance: Automatically execute enhancement process using Gemini API. Optional custom prompt can be provided

Output

The tool generates Markdown files (.md) in the specified output directory. Each output file has the same name as the input audio file but with a .md extension.

Sample Output Format

# Transcript: example_audio

**Source:** example_audio.mp3

## Content

[Without timestamps]
This is the transcribed content of the audio file...

[With timestamps (-ts flag)]
[00:00:00 - 00:00:05] This is the transcribed content
[00:00:05 - 00:00:10] of the audio file with timestamps...

System Requirements

Python 3.8 or higher
CUDA-compatible GPU (optional, for faster processing)
Sufficient disk space for model downloads (~3GB for Whisper large-v3)

For M4A/AAC Support (Optional)

If you need to process M4A/AAC files, install system ffmpeg:

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

macOS:

brew install ffmpeg

Windows: Download from https://ffmpeg.org/download.html and add to PATH

Dependencies

Core Dependencies

transformers>=4.52.0
torch>=2.2.0
librosa>=0.10.0
pydub>=0.25.0
ffmpeg-python>=0.2.0
soundfile>=0.12.0

Optional Performance Dependencies

ninja>=1.11.0 (build system for Flash Attention)
psutil>=7.0.0 (system utilities for Flash Attention)
flash-attn>=2.7.4 (Flash Attention 2 for GPU acceleration)

Optional Enhancement Dependencies

google-generativeai>=0.8.0 (Google Gemini API for transcript enhancement)

Flash Attention 2 Requirements (Optional)

Flash Attention 2 provides significant performance improvements for GPU processing:

GPU Requirements:

NVIDIA GPUs: Ampere, Ada, or Hopper architecture (RTX 3090, RTX 4070/4090, A100, H100, etc.)
CUDA >= 12.3 (recommended: CUDA 12.8)
AMD GPUs: MI200 or MI300 with ROCm 6.0+

Installation:

# Install core dependencies first
pip install -r requirements.txt

# Install Flash Attention 2 (requires --no-build-isolation)
pip install flash-attn --no-build-isolation

Note: Flash Attention 2 installation may take 5-10 minutes as it compiles from source. If installation fails, the tool will automatically fall back to standard attention.

Transcript Enhancement with Gemini API (Optional)

The enhancement feature uses Google's Gemini 2.0 Flash Experimental model to improve transcript quality:

Setup:

Get a Google AI API key from Google AI Studio

Set the environment variable:

export GEMINI_API_KEY='your-api-key-here'

Install the dependency:
```
pip install google-generativeai>=0.8.0
```

Enhancement Features:

Grammar and punctuation correction
Improved sentence structure and readability
Technical term correction based on context
Removal of excessive filler words
Better formatting with headings and structure
Optional translation to target language

Usage:

# Basic enhancement
python3 convert.py -i audio.mp3 -o ./output/ -e

# Enhancement with custom prompt
python3 convert.py -i lecture.mp3 -o ./output/ -e "Focus on technical accuracy"

# Enhancement with translation
python3 convert.py -i spanish_audio.mp3 -o ./output/ -tr en -e

Standalone Enhancement: You can also enhance existing transcripts using enhance.py directly:

python3 enhance.py -i transcript.md -o enhanced.md -v
python3 enhance.py -i transcript.md -o enhanced.md -tr es

Rate Limits: Free tier supports 5 requests per minute. The tool handles rate limiting automatically.

Notes

First run will download the Whisper large-v3 model (~3GB)
GPU acceleration is automatically used if available
Processing time depends on audio length and hardware capabilities
The tool handles various audio sample rates automatically
MP3, WAV, and FLAC files work out of the box
M4A/AAC files require system ffmpeg installation

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
core		core
.gitignore		.gitignore
README.md		README.md
convert.py		convert.py
enhance.py		enhance.py
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Whisper Transcriber

Features

Setup

Usage

Basic Usage

Advanced Usage

Parameters

Required Parameters

Optional Parameters

Output

Sample Output Format

System Requirements

For M4A/AAC Support (Optional)

Dependencies

Core Dependencies

Optional Performance Dependencies

Optional Enhancement Dependencies

Flash Attention 2 Requirements (Optional)

Transcript Enhancement with Gemini API (Optional)

Notes

About

Uh oh!

Languages

JonPark0/whisper_transcribe

Folders and files

Latest commit

History

Repository files navigation

Whisper Transcriber

Features

Setup

Usage

Basic Usage

Advanced Usage

Parameters

Required Parameters

Optional Parameters

Output

Sample Output Format

System Requirements

For M4A/AAC Support (Optional)

Dependencies

Core Dependencies

Optional Performance Dependencies

Optional Enhancement Dependencies

Flash Attention 2 Requirements (Optional)

Transcript Enhancement with Gemini API (Optional)

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages