Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
87c582a
feat(capture): dictation, personalities, 0.5.0
jamiepine Apr 23, 2026
0cef2c9
feat(mcp): local MCP server exposes voicebox.* tools to AI agents
jamiepine Apr 23, 2026
6b75e09
feat(mcp): Rust-owned speaking pill with self-contained audio playback
jamiepine Apr 23, 2026
868a40f
readme and dev script
jamiepine Apr 23, 2026
85a3e13
feat(capture): gate global hotkey on dictation readiness checklist
jamiepine Apr 23, 2026
ef00570
color
jamiepine Apr 23, 2026
be73a33
model download status
jamiepine Apr 23, 2026
7c50e18
progress
jamiepine Apr 23, 2026
abf5dfd
personality: bool API, i18n across the app
jamiepine Apr 24, 2026
3f1ab75
i18n: GenerationPage sidebar copy
jamiepine Apr 24, 2026
0b2a3cd
fix: BOOL import for windows crate 0.62
jamiepine Apr 24, 2026
53a7693
fix(capture): layout-aware V keycode for synthetic paste on macOS
jamiepine Apr 24, 2026
c65531b
fix(capture): cooperative app activation for synthetic paste on macOS…
jamiepine Apr 24, 2026
ea7a4e9
fix(capture): conditional clipboard restore + always-attempt on paste…
jamiepine Apr 24, 2026
27798dd
chore(deps): pin rdev to jamiepine/rdev fork
jamiepine Apr 24, 2026
239523d
fix(mcp): idle timeout + escalating backoff for speak-SSE monitor
jamiepine Apr 24, 2026
0081e97
fix(refinement): character-level loop collapse + pytest coverage
jamiepine Apr 24, 2026
c0eba9c
fix(db): graceful fallback when SQLite < 3.35 on MCP bindings migration
jamiepine Apr 24, 2026
30c2cf1
fix(mcp): correct lifespan shutdown order — drain MCP before unloadin…
jamiepine Apr 24, 2026
c9103a2
fix(mcp): stamp last_seen_at on /speak too + tighten path predicate
jamiepine Apr 24, 2026
67a9e30
feat(captures): scrubbable WaveSurfer player for capture detail view
jamiepine Apr 24, 2026
687ab2a
feat(ui): persist selectedProfileId across sessions
jamiepine Apr 24, 2026
f21778f
feat(captures): mirror readiness checklist into the settings sidebar
jamiepine Apr 24, 2026
66ca56b
feat(captures): move sidebar checklist below differences + hide when …
jamiepine Apr 24, 2026
ef63faf
fix(captures): refetch readiness immediately after STT/LLM model swap
jamiepine Apr 24, 2026
1ca7895
fix(captures): hide macOS-only copy when running on Windows / Linux
jamiepine Apr 24, 2026
c6114b6
feat(capture): swap the rdev fork for keytap 0.2, delete local chord …
jamiepine Apr 24, 2026
c7cd7fd
chore(deps): bump keytap 0.2 → 0.4 for macOS modifier-events fix
jamiepine Apr 24, 2026
271ecd9
perf(captures): stop polling readiness once both models are green
jamiepine Apr 24, 2026
2483324
feat(ui): theme settings, stories polish, track editor restructure
jamiepine Apr 24, 2026
736a661
fix mlx llm bundling
jamiepine Apr 24, 2026
b97c565
feat(ui): shared ListPane primitive + misc polish
jamiepine Apr 25, 2026
7ad91f5
fix(captures): Play As autoplay + default voice + orphan recovery
jamiepine Apr 25, 2026
5f62a0e
fix(captures+chord): Stop button stops, ChordPicker accepts shorter c…
jamiepine Apr 25, 2026
800e390
fix(settings): honor explicit null on nullable fields, ignore on the …
jamiepine Apr 25, 2026
9525bff
fix(captures): clean up audio files when create_capture fails
jamiepine Apr 25, 2026
c5b7760
fix(mcp): restrict voicebox.transcribe(audio_path=...) to loopback
jamiepine Apr 25, 2026
0aa3a8d
fix: PR review nits — response shape, landing copy, form reset
jamiepine Apr 25, 2026
51c46cd
perf(mcp): move last_seen_at stamp off the request path
jamiepine Apr 25, 2026
c113faf
fix(dictate): force-dismiss the speaking pill when SSE never comes back
jamiepine Apr 25, 2026
70ec8d9
fix: i18n cleanup + readiness checklist effect cadence + ChordPicker …
jamiepine Apr 25, 2026
6cf8da2
perf(settings): persist generation sliders on release, not per pointe…
jamiepine Apr 25, 2026
fe14df5
chore(backend): Ruff lint pass — deprecated APIs, exception leaks, de…
jamiepine Apr 25, 2026
2a93798
feat(stories): regenerate action on clips and the chat list dropdown
jamiepine Apr 25, 2026
c43f2d4
feat(stories): import external audio into the timeline (drag-drop + p…
jamiepine Apr 25, 2026
3d4d0a9
fix(audio): serve real Content-Type so imports decode in WaveSurfer
jamiepine Apr 25, 2026
4e7f8f9
feat(stories): zoom bar bounds tracked to project length, default 60s…
jamiepine Apr 25, 2026
d526a9e
fix(stories): show the source filename on imported clips
jamiepine Apr 25, 2026
935efed
fix(stories): round split_time_ms before posting
jamiepine Apr 25, 2026
9f183e5
feat(stories): per-clip volume control on the timeline
jamiepine Apr 25, 2026
e7846ea
fix(stories): mute the clip waveform's media element so it can't blee…
jamiepine Apr 25, 2026
9b3fa17
fix(stories): hard-cut the audio graph on stop so long imports actual…
jamiepine Apr 25, 2026
c7f50d5
feat(stories): add empty tracks above/below the timeline
jamiepine Apr 25, 2026
2e5b8d2
fix(mcp): bundle stdio shim sidecar
jamiepine Apr 25, 2026
1662508
fix(captures): allow dictation without paste permission
jamiepine Apr 25, 2026
29f99a8
fix(mcp): preserve speak engine defaults
jamiepine Apr 25, 2026
2a1bb6f
fix(captures): use platform hotkey defaults
jamiepine Apr 25, 2026
3f2c22b
fix(mcp): preload speak pill window
jamiepine Apr 25, 2026
2c3df38
fix(captures): hide unwired storage settings
jamiepine Apr 25, 2026
4427af9
feat(sponsors): add /sponsors page, homepage promo, and in-app strip
jamiepine Apr 25, 2026
a6a5717
style(landing): drop pill chrome from /download maintainer kicker
jamiepine Apr 25, 2026
0dbdec4
changelog
jamiepine Apr 25, 2026
2c499fe
Merge remote-tracking branch 'origin/main' into feat/capture
jamiepine Apr 25, 2026
95b0c41
better naming for sponsors
jamiepine Apr 25, 2026
2bcb98d
windows keybind note
jamiepine Apr 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.4.5
current_version = 0.5.0
commit = True
tag = True
tag_name = v{new_version}
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/build-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,14 @@ jobs:
run: |
cd backend
python build_binary.py
python build_binary.py --shim

PLATFORM=$(rustc --print host-tuple)
mkdir -p ../tauri/src-tauri/binaries
cp dist/voicebox-server.exe ../tauri/src-tauri/binaries/voicebox-server-${PLATFORM}.exe
cp dist/voicebox-mcp.exe ../tauri/src-tauri/binaries/voicebox-mcp-${PLATFORM}.exe
echo "Built voicebox-server-${PLATFORM}.exe"
echo "Built voicebox-mcp-${PLATFORM}.exe"

- name: Setup Bun
uses: oven-sh/setup-bun@v2
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ jobs:
run: |
cd backend
python build_binary.py
python build_binary.py --shim

# Get platform tuple
PLATFORM=$(rustc --print host-tuple)
Expand All @@ -123,7 +124,9 @@ jobs:

# Copy with platform suffix
cp dist/voicebox-server.exe ../tauri/src-tauri/binaries/voicebox-server-${PLATFORM}.exe
cp dist/voicebox-mcp.exe ../tauri/src-tauri/binaries/voicebox-mcp-${PLATFORM}.exe
echo "Built voicebox-server-${PLATFORM}.exe"
echo "Built voicebox-mcp-${PLATFORM}.exe"

- name: Setup Bun
uses: oven-sh/setup-bun@v2
Expand Down
11 changes: 11 additions & 0 deletions .mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"mcpServers": {
"voicebox": {
"type": "http",
"url": "http://127.0.0.1:17493/mcp",
"headers": {
"X-Voicebox-Client-Id": "claude-code"
}
}
}
}
87 changes: 85 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,90 @@

# Changelog

## [Unreleased]
## [0.5.0] - 2026-04-22

**The Capture release.** Voicebox stops being just a voice-cloning studio and becomes a full AI voice studio. Hold a key anywhere on your machine, speak, release — the transcript lands in the focused text field. Flip the primitive around and any MCP-aware agent — Claude Code, Cursor, Spacebot — speaks back through an on-screen pill in one of your cloned voices. A local LLM sits between the two, so transcripts come out clean and voice profiles can carry a personality that reshapes what the agent says before it gets spoken.

### Dictation — speak anywhere, paste anywhere

- **Global hotkey capture.** Hold a customizable chord anywhere on your machine (defaults: right-Cmd + right-Option on macOS, right-Ctrl + right-Shift on Windows), speak, release. A floating on-screen pill walks through recording → transcribing → refining → done with a live elapsed timer. The transcript lands as clean text.
- **Push-to-talk and toggle modes, each with its own chord.** The default toggle chord adds Space to the push-to-talk chord. Holding PTT and tapping Space mid-hold upgrades a hold into a hands-free session without a gap in the recording.
- **Auto-paste into the focused app.** Once transcription finishes, Voicebox synthesizes a paste into whatever text field had focus when you started the chord — not wherever focus drifted while you were talking. Works across Dvorak / AZERTY layouts. Your clipboard is saved before and restored after.
- **Chord picker UI.** Customize either chord from Settings → Captures by holding the keys you want. Left/right modifier badges show whether a key is the left or right variant.
- **Defaults stay out of your way.** macOS defaults avoid left-hand Cmd+Option chords so the system shortcuts they collide with stay yours. Windows defaults route around AltGr collisions on German / French / Spanish layouts.
- **Accessibility permission is scoped.** If macOS Accessibility isn't granted, dictation still runs and transcripts still land in the Captures tab — only synthetic paste is disabled. The permission prompt lives inline next to the auto-paste toggle, not as a global banner.

### Personality — voice profiles that speak for themselves

Voice profiles now carry an optional **personality** — a free-form description of who this voice is, up to 2000 characters. When set, two new controls appear next to the generate button, each powered by a new Qwen3 LLM running entirely locally:

- **Compose** — the shuffle button drops a fresh in-character line into the textarea. Click again for variety, edit before speaking.
- **Speak in character** — the wand toggle runs your input through the personality LLM before TTS, preserving every idea but delivering it in the character's voice.

The same LLM doubles as the refinement model, so there's one local LLM in the app, not two.

**API surface.** `POST /generate`, `POST /speak`, and the MCP `voicebox.speak` tool accept `personality: bool`. `POST /profiles/{id}/compose` powers the shuffle button. MCP client bindings carry a `default_personality: bool` that applies when `personality` isn't passed explicitly.

### Agents — any MCP-aware agent gets a voice

Voicebox ships a built-in **Model Context Protocol** server at `http://127.0.0.1:17493/mcp` so Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions — any MCP-aware agent — can call into your local Voicebox install. Four tools ship with dotted names:

- **`voicebox.speak`** — speak text in any voice profile, with optional `personality: true` to run through the profile's personality LLM first
- **`voicebox.transcribe`** — Whisper transcription of a base64 blob or an absolute local path. Path mode is restricted to loopback callers so a Voicebox bound on `0.0.0.0` doesn't double as an unauthenticated arbitrary-local-file read primitive.
- **`voicebox.list_captures`** — recent captures with their transcripts
- **`voicebox.list_profiles`** — available voice profiles (cloned + preset)

- **Streamable HTTP as primary transport.** Cursor / Windsurf / VS Code / Claude Code all support it out of the box — drop a `mcpServers` block with the URL and an `X-Voicebox-Client-Id` header.
- **Stdio shim for clients that don't speak HTTP MCP.** A `voicebox-mcp` binary ships inside the app bundle as a Tauri sidecar. The Settings page renders the install snippet with the right absolute path pre-filled.
- **Per-client voice binding.** Pin Claude Code to Morgan, Cursor to Scarlett, Cline to its own voice — the `X-Voicebox-Client-Id` header resolves to a bound voice whenever `speak` is called without an explicit `profile`. Managed in **Settings → MCP**.
- **Profile resolution precedence.** Explicit `profile` arg (name or id, case-insensitive) → per-client binding → global default from `capture_settings.default_playback_voice_id` → error with a pointer to Settings.
- **Speaking pill.** Agent-initiated speech surfaces the same on-screen pill as dictation, in a `speaking` state with the profile name and an elapsed timer. Silent background TTS is a trust hazard — the pill always shows what's coming out of your machine.
- **`POST /speak` REST wrapper.** Same code path and voice resolution for shell scripts, ACP, A2A, GitHub Actions, or anything else that isn't MCP-native.

**Claude Code one-liner:**

```
claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
```

### Refinement

A clean transcript needs more than Whisper. Each capture flows through a small Qwen3 LLM that strips fillers, fixes punctuation, and optionally rewrites self-corrections — all on-device.

- **Loop-stripping before the LLM sees the transcript.** Whisper's "thanks for watching thanks for watching thanks for watching…" hallucination loops are collapsed at a six-identical-tokens threshold (case-insensitive) so a small refinement model can't echo them back. Coverage spans single-word runs, multi-word phrases, CJK character runs, and Japanese emphasis patterns; legitimate repetition ("no, no, no, no, no") doesn't cross the threshold.
- **Per-capture flag snapshot.** `smart_cleanup`, `self_correction`, and `preserve_technical` are stored on each capture, so refinement can be re-run later with different flags without losing the raw transcript.
- **Model picker** — Qwen3 0.6B (400 MB, very fast), 1.7B (1.1 GB, fast), 4B (2.5 GB, full quality). 0.6B is the default; 1.7B is the sweet spot for transcripts with code identifiers.

### Captures tab + settings

Settings → Captures is now the home for the whole dictation flow:

- **Dictation**: global shortcut toggle, push-to-talk chord picker, toggle chord picker, live pill preview, auto-paste into focused field (with inline accessibility prompt).
- **Transcription**: model picker (Whisper Base / Small / Medium / Large / Turbo), language lock.
- **Refinement**: auto-refine toggle, model picker, smart cleanup, remove self-corrections, preserve technical terms.
- **Playback**: default voice for the Captures tab's "Play as" action — picking a voice from the split-button persists the choice across tab switches and restarts.
- **Storage**: captures folder quick-open.

### Stories — timeline editor

The Stories tab graduates from a TTS sequencer into a real timeline editor. Same generation-row backing, but clips now compose with imported audio, per-clip levels, and a flexible track stack.

- **Import external audio.** Drag a music file onto the story content area or pick one from the new "Import audio" entry in the add-clip popover. Accepted formats: wav / mp3 / flac / ogg / m4a / aac / webm, capped at 200 MB. Imported clips show their filename instead of a profile name and skip the regenerate / version-picker controls — there's nothing to regenerate.
- **Per-clip volume.** A `Volume2` icon in the clip-edit toolbar opens a 0–200% slider. Adjustments apply live and to exports. Split and duplicate carry the volume forward into the new clips.
- **Regenerate** from both the clip's chat-list dropdown and the track-editor toolbar. Re-runs the underlying generation through the same path the History tab uses, with completion tracked in the global pending set.
- **Add empty tracks above or below the timeline** via tiny `+` strips at the top of the topmost label cell and the bottom of the bottommost. Sticky in the label column so they follow horizontal scroll.
- **Zoom bar tracks the project.** Min scope is 10 seconds visible (zoomed in cap), max is the entire project (zoomed out cap), default lands on 60 s. Both the +/− buttons and the scrollbar edge-drag handles clamp to those dynamic bounds.

### Interface

- **Theme selector.** Light / dark / system in **Settings → General**, persisted across sessions. System mode listens for OS-level appearance changes and flips live without a restart.
- **Scrubbable waveform player on captures.** The capture detail card now embeds a WaveSurfer waveform with click-to-seek and a current / total timestamp pair, replacing the static duration label.
- **Capture pill light mode.** The on-screen pill gets a dedicated light palette so it stays legible against bright windows.
- **Readiness checklist in the Captures settings sidebar.** The same six-gate checklist the Captures empty state uses mirrors into Settings → Captures so a red gate can't hide behind a green toggle. Hidden once every gate is green. macOS-only rows (Input Monitoring, Accessibility) hide entirely on Windows and Linux.

### Windows parity

Same dictation flow on Windows. Right-hand default chord (Ctrl+Shift) avoids AltGr collisions on layouts where Ctrl+Alt is the compose key. Focus is captured at chord-start so paste lands in the original field even if focus drifts during transcribe/refine.

## [0.4.5] - 2026-04-22

Expand Down Expand Up @@ -657,7 +740,7 @@ The first public release of Voicebox — an open-source voice synthesis studio p

Tauri v2, React, TypeScript, Tailwind CSS, FastAPI, Qwen3-TTS, Whisper, SQLite

[Unreleased]: https://github.com/jamiepine/voicebox/compare/v0.4.5...HEAD
[0.5.0]: https://github.com/jamiepine/voicebox/compare/v0.4.5...v0.5.0
[0.4.5]: https://github.com/jamiepine/voicebox/compare/v0.4.4...v0.4.5
[0.4.4]: https://github.com/jamiepine/voicebox/compare/v0.4.3...v0.4.4
[0.4.3]: https://github.com/jamiepine/voicebox/compare/v0.4.2...v0.4.3
Expand Down
Loading
Loading