Complete usage documentation for speak.
# Text input
speak "Hello, world!"
# File input
speak article.txt
speak document.md
# Clipboard input
speak --clipboard
speak -c
# Play audio after generation
speak "Hello!" --play
# Stream audio as it generates (for long text)
speak article.md --streamFor documents that might timeout (>5 min generation):
# Auto-chunk long documents for reliable generation
speak book-chapter.md --auto-chunk --output chapter.wav
# Get duration estimate before generating
speak --estimate document.md
# Resume interrupted generation
speak --resume ~/.chatter/manifest.json
# Preview without generating
speak --dry-run document.md --auto-chunk
# Keep intermediate chunks for inspection
speak document.md --auto-chunk --output doc.wav --keep-chunks- Text is split at sentence boundaries (~6000 chars per chunk by default)
- Each chunk is generated and saved to disk immediately
- A manifest file tracks progress
- On completion, chunks are concatenated with sox
- If interrupted, use
--resumeto continue
Process multiple files at once:
# Process multiple files
speak chapter1.md chapter2.md chapter3.md --output-dir ~/Audio/book/
# Skip already-generated files
speak *.md --output-dir ~/Audio/ --skip-existing
# Stop on first error (default: continue)
speak *.md --output-dir ~/Audio/ --stop-on-errorOutput files are named after input files (e.g., chapter1.md → chapter1.wav).
Combine multiple audio files into one:
speak concat part1.wav part2.wav part3.wav --out combined.wavFiles are sorted naturally, so chunk_0001.wav, chunk_0002.wav, etc. are handled correctly.
Requires sox: brew install sox
# Strip markdown syntax (default)
speak document.md --markdown plain
# Smart mode: adds [clear throat] before headers for emphasis
speak document.md --markdown smartspeak document.md --code-blocks read # Read code verbatim (default)
speak document.md --code-blocks skip # Skip code blocks entirely
speak document.md --code-blocks placeholder # Replace with "[code block omitted]"# List available models
speak models
# Use a specific model
speak "Hello" --model mlx-community/chatterbox-turbo-fp16
# Adjust temperature (0-1, default 0.5)
speak "Hello" --temp 0.7
# Adjust speed (0-2, default 1.0)
speak "Hello" --speed 1.2
# Voice cloning with reference audio
speak "Hello" --voice ~/voices/sample.wav| Model | Description |
|---|---|
mlx-community/chatterbox-turbo-8bit |
8-bit quantized, fastest (default) |
mlx-community/chatterbox-turbo-fp16 |
Full precision, highest quality |
mlx-community/chatterbox-turbo-4bit |
4-bit quantized, smallest memory |
mlx-community/chatterbox-turbo-5bit |
5-bit quantized |
mlx-community/chatterbox-turbo-6bit |
6-bit quantized |
# Output to specific file
speak "Hello" --output ~/Desktop/greeting.wav
# Output to directory (auto-generates filename)
speak "Hello" --output ~/Desktop/
# Preview mode: generate first sentence only
speak article.md --preview --playFor long text, streaming plays audio as it generates:
speak article.md --streamFeatures:
- Buffers 3 seconds before starting playback
- Maintains minimum 1-second buffer
- Auto-rebuffers if generation falls behind
- Press Ctrl+C to stop cleanly
Best for content longer than a few sentences.
Keep the TTS server running between calls for faster subsequent generations:
# Keep server running after generation
speak "Hello" --daemon --play
speak "Another phrase" --daemon --play # Much faster!
# Check server status
speak health
# Stop the daemon when done
speak daemon killThe server also auto-shuts down after 1 hour of idle.
Add expressive sounds inline with text:
speak "[sigh] I can't believe it's Monday again." --play
speak "[laugh] That's hilarious!" --play
speak "[clear throat] Welcome to the presentation." --play| Tag | Effect |
|---|---|
[laugh] |
Laughing |
[chuckle] |
Light chuckle |
[sigh] |
Sighing |
[gasp] |
Gasping |
[groan] |
Groaning |
[clear throat] |
Throat clearing |
[cough] |
Coughing |
[crying] |
Crying/emotional |
[singing] |
Sung speech |
Note: [pause] and [whisper] are NOT reliably supported. Use punctuation (periods, ellipses) for pauses.
Benchmarked on MacBook Pro M1 Max:
| Mode | RTF | Speed | Best For |
|---|---|---|---|
| Non-streaming | 0.3-0.5x | 2-3x real-time | Short text |
| Streaming | 0.5-0.8x | 1.2-2x real-time | Long text |
RTF = Real-Time Factor (lower is faster)
- Cold start: ~4-8s to first audio (model loading)
- Warm start: ~2-4s to first audio (model cached)