Usage Guide

Complete usage documentation for speak.

Basic Usage

# Text input
speak "Hello, world!"

# File input
speak article.txt
speak document.md

# Clipboard input
speak --clipboard
speak -c

# Play audio after generation
speak "Hello!" --play

# Stream audio as it generates (for long text)
speak article.md --stream

Long Documents

For documents that might timeout (>5 min generation):

# Auto-chunk long documents for reliable generation
speak book-chapter.md --auto-chunk --output chapter.wav

# Get duration estimate before generating
speak --estimate document.md

# Resume interrupted generation
speak --resume ~/.chatter/manifest.json

# Preview without generating
speak --dry-run document.md --auto-chunk

# Keep intermediate chunks for inspection
speak document.md --auto-chunk --output doc.wav --keep-chunks

How Auto-Chunking Works

Text is split at sentence boundaries (~6000 chars per chunk by default)
Each chunk is generated and saved to disk immediately
A manifest file tracks progress
On completion, chunks are concatenated with sox
If interrupted, use --resume to continue

Batch Processing

Process multiple files at once:

# Process multiple files
speak chapter1.md chapter2.md chapter3.md --output-dir ~/Audio/book/

# Skip already-generated files
speak *.md --output-dir ~/Audio/ --skip-existing

# Stop on first error (default: continue)
speak *.md --output-dir ~/Audio/ --stop-on-error

Output files are named after input files (e.g., chapter1.md → chapter1.wav).

Concatenating Audio Files

Combine multiple audio files into one:

speak concat part1.wav part2.wav part3.wav --out combined.wav

Files are sorted naturally, so chunk_0001.wav, chunk_0002.wav, etc. are handled correctly.

Requires sox: brew install sox

Markdown Processing

# Strip markdown syntax (default)
speak document.md --markdown plain

# Smart mode: adds [clear throat] before headers for emphasis
speak document.md --markdown smart

Code Block Handling

speak document.md --code-blocks read        # Read code verbatim (default)
speak document.md --code-blocks skip        # Skip code blocks entirely
speak document.md --code-blocks placeholder # Replace with "[code block omitted]"

Voice & Model Options

# List available models
speak models

# Use a specific model
speak "Hello" --model mlx-community/chatterbox-turbo-fp16

# Adjust temperature (0-1, default 0.5)
speak "Hello" --temp 0.7

# Adjust speed (0-2, default 1.0)
speak "Hello" --speed 1.2

# Voice cloning with reference audio
speak "Hello" --voice ~/voices/sample.wav

Available Models

Model	Description
`mlx-community/chatterbox-turbo-8bit`	8-bit quantized, fastest (default)
`mlx-community/chatterbox-turbo-fp16`	Full precision, highest quality
`mlx-community/chatterbox-turbo-4bit`	4-bit quantized, smallest memory
`mlx-community/chatterbox-turbo-5bit`	5-bit quantized
`mlx-community/chatterbox-turbo-6bit`	6-bit quantized

Output Options

# Output to specific file
speak "Hello" --output ~/Desktop/greeting.wav

# Output to directory (auto-generates filename)
speak "Hello" --output ~/Desktop/

# Preview mode: generate first sentence only
speak article.md --preview --play

Streaming Mode

For long text, streaming plays audio as it generates:

speak article.md --stream

Features:

Buffers 3 seconds before starting playback
Maintains minimum 1-second buffer
Auto-rebuffers if generation falls behind
Press Ctrl+C to stop cleanly

Best for content longer than a few sentences.

Daemon Mode

Keep the TTS server running between calls for faster subsequent generations:

# Keep server running after generation
speak "Hello" --daemon --play
speak "Another phrase" --daemon --play  # Much faster!

# Check server status
speak health

# Stop the daemon when done
speak daemon kill

The server also auto-shuts down after 1 hour of idle.

Emotion Tags

Add expressive sounds inline with text:

speak "[sigh] I can't believe it's Monday again." --play
speak "[laugh] That's hilarious!" --play
speak "[clear throat] Welcome to the presentation." --play

Supported Tags

Tag	Effect
`[laugh]`	Laughing
`[chuckle]`	Light chuckle
`[sigh]`	Sighing
`[gasp]`	Gasping
`[groan]`	Groaning
`[clear throat]`	Throat clearing
`[cough]`	Coughing
`[crying]`	Crying/emotional
`[singing]`	Sung speech

Note: [pause] and [whisper] are NOT reliably supported. Use punctuation (periods, ellipses) for pauses.

Performance

Benchmarked on MacBook Pro M1 Max:

Mode	RTF	Speed	Best For
Non-streaming	0.3-0.5x	2-3x real-time	Short text
Streaming	0.5-0.8x	1.2-2x real-time	Long text

RTF = Real-Time Factor (lower is faster)

Cold start: ~4-8s to first audio (model loading)
Warm start: ~2-4s to first audio (model cached)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage Guide

Basic Usage

Long Documents

How Auto-Chunking Works

Batch Processing

Concatenating Audio Files

Markdown Processing

Code Block Handling

Voice & Model Options

Available Models

Output Options

Streaming Mode

Daemon Mode

Emotion Tags

Supported Tags

Performance

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Usage Guide

Basic Usage

Long Documents

How Auto-Chunking Works

Batch Processing

Concatenating Audio Files

Markdown Processing

Code Block Handling

Voice & Model Options

Available Models

Output Options

Streaming Mode

Daemon Mode

Emotion Tags

Supported Tags

Performance