A voice-driven assistant wrapped in an Iron-Man holographic HUD. It talks back in a British butler voice, sees through your webcam, generates images and 3D models on your own GPU, remembers everything across sessions, and commands your PC through 80+ tools — all running on your machine. No cloud required.
Your own Iron-Man AI — voice, vision, image + 3D gen, 80+ tools, 100% on your hardware.
⭐ Star it if you'd run this on your machine.
J.A.R.V.I.S. is a desktop AI command center for Windows. Point it at any OpenAI-compatible model (local or cloud), and you get a cinematic, full-screen holographic interface that you talk to like Tony Stark talks to his lab.
It is not a chat box. It is a window manager built out of tools — every action happens inside the UI. Ask it to generate an image and the picture materializes in the Media Bay. Ask it to look at you and the webcam opens. Ask what it remembers and a 3D galaxy of your data unfolds. Say "Hey JARVIS" and it wakes; say "cancel" and it stops talking.
You: "Hey JARVIS, generate a chrome helmet on a black background, then make it 3D."
JARVIS: "Right away, sir." → FLUX renders the image → TRELLIS lifts it into a
rotatable GLB hologram in the Media Bay. ~Seconds, fully local, on your GPU.
Cloud assistants rent you intelligence and read your data. JARVIS doesn't.
- 🔒 100% local-first — your voice, webcam, files, and memory never leave the machine. No telemetry, no account, no "your chats may be reviewed to improve our models."
- 💸 No subscription — bring a model you already run (Ollama · LM Studio · vLLM) or a key you control. No monthly seat, no per-token meter.
- 🧰 It actually does things — not a chat window. It runs code, drives a real browser, sees through your camera, generates images and 3D models, and remembers — across 80+ tools.
- 🎬 It feels like the films — wake word, British-butler voice, holographic HUD, arc-reactor core. Built to be lived in, not just demoed once.
- 🪟 Yours to hack — open, plugin-based, fully inspectable. Add a new tool in a single file.
The whole point: an assistant as capable as the cloud ones — that you actually own.
| 🎙️ Full voice loop | Wake-word ("Hey JARVIS"), server-side Whisper STT, British-butler TTS, barge-in cancel, time-aware greetings. |
| 🧠 Real memory | Semantic cross-session recall via embeddings, notes, fixes, paths, prefs — visualized as a 3D Neural Galaxy. |
| 🖼️ Local image gen | FLUX.2-klein-4B (Q6_K GGUF, 4 steps) renders in seconds on a 16 GB GPU — no API, no watermark. |
| 🧊 Local image→3D | TRELLIS.2-4B FP8 / Hunyuan3D-2 mini turn any image into a textured GLB, rendered inline. |
| 👁️ Vision | One-shot webcam analysis, live preview, screen capture, OCR, drag-and-drop image Q&A. |
| 🌐 Web + browser | Web/image/video search as cards, readable page fetch, and a visible Playwright-driven Chromium. |
| 🛠️ Owns your PC | Files, shell, PowerShell, Python/Node sandboxes, package installs, live CPU/GPU/RAM/disk/net telemetry. |
| 🤖 Agent swarm | A 9-role agent matrix with an autonomous plan→execute→reflect loop, plus delegation to Claude Code / Codex / OpenCode. |
| 🪟 The HUD | Arc-reactor core, voice-reactive orb, parallax depth, boot sequence, live terminal — React + Three.js. |
| 🔒 Safe by design | SAFE / CAUTION / DANGEROUS tool tiers, a hard denylist, action logs, and automatic edit backups. |
|
Cold boot — kernel init, natural-language core, British voice profile, secure local uplink. |
Full cockpit — arc-reactor core, live subsystem telemetry, terminal + mission log, dialogue. |
Requirements: Windows 10/11 · Python 3.10+ · Node 18+ · an NVIDIA GPU (recommended, for image/3D/Whisper) · and an LLM endpoint (e.g. Ollama or LM Studio running locally — or an API key).
:: 1. Clone
git clone https://github.com/Dix01/JARVIS.git
cd JARVIS
:: 2. Install backend venv + frontend deps + Electron
setup.bat
:: 3. Configure
copy .env.example .env
notepad .env :: put your API key(s) here (optional for local models)
notepad config.yaml :: point `model.endpoint` at your LLM
:: 4. Launch
run.batThen open the desktop app (Electron) — or visit:
http://127.0.0.1:7341
Developing? run-dev.bat gives you hot-reload on the frontend.
💡 No GPU? It still runs. Image/3D/Whisper degrade gracefully and heavy models load lazily only when first used. Chat, web, memory, vision-via-API, and the full HUD work on any machine.
J.A.R.V.I.S. speaks to any OpenAI-compatible /v1 endpoint or the Anthropic
Messages API. Edit config.yaml:
model:
provider: openai_compatible # or: anthropic
endpoint: http://localhost:11434/v1 # Ollama, LM Studio, vLLM, OpenRouter, OpenAI…
model: your-model-name
api_key_env: OLLAMA_API_KEY # the .env variable to read the key from
native_tool_calls: true
temperature: 0.3| Provider | endpoint example |
Notes |
|---|---|---|
| Ollama | http://localhost:11434/v1 |
Free, local. Pull a tool-capable model. |
| LM Studio | http://localhost:1234/v1 |
Free, local, GUI. |
| vLLM | http://localhost:8000/v1 |
Self-hosted, fast. |
| OpenRouter | https://openrouter.ai/api/v1 |
Hundreds of models, one key. |
| OpenAI | https://api.openai.com/v1 |
GPT-4o etc. |
| Anthropic | (set provider: anthropic) |
Claude via Messages API. |
For the best experience, pick a model with strong native tool-calling.
A complete, hands-free loop — engineered to feel like the films:
- Wake word — say "Hey JARVIS" (fuzzy-matched, survives Whisper mishears like "jarvis / jervis / charvis"). A follow-up window keeps the mic armed so you don't repeat it every sentence.
- Server-side STT — faster-whisper with VAD, hallucination filtering, and clip coalescing (a mid-sentence pause won't split your command into two).
- "Mute mic except cancel" — while JARVIS speaks, the mic is muted to commands; say "cancel" (or "stop talking", "nevermind") and the speech cuts instantly.
- Time-aware greeting — boots up with "Good morning/afternoon/evening, sir."
- Cinematic TTS, in priority order:
| Tier | Engine | Voice |
|---|---|---|
| 1 | Piper (local) | jgkawell/jarvis ONNX — closest to the Paul-Bettany MCU voice |
| 2 | NVIDIA Riva (cloud) | needs NVIDIA_API_KEY |
| 3 | Edge TTS (fallback) | en-GB-ThomasNeural / RyanNeural — deep British male |
The persona is a concise British butler-AI: status → diagnosis → recommended action. Addresses you as "sir", anticipates the next step, never rambles.
Everything is stored locally in SQLite — nothing leaves your machine.
- Semantic recall — embeddings (e.g.
nomic-embed-text) give true cross-session memory; gracefully falls back to keyword search if embeddings are unavailable. - Stores preferences, free-form notes, known fixes for recurring errors, labeled paths, installed tools, and your most-used commands.
- Neural Galaxy — say "show me what you remember" and your entire memory store blooms as an interactive 3D point cloud, clustered by category.
- Context compaction keeps long sessions coherent without blowing the token budget.
- Autonomous planning — for complex goals, JARVIS runs a plan → execute → reflect loop, surfaced live in the Plan panel.
- Proactive suggestions — after each action it offers the next likely step.
All of the above is toggleable under ultimate: in config.yaml.
Black Forest Labs' fastest distilled model, run from a Q6_K GGUF (~3.3 GB on disk):
- 4-step rectified-flow → seconds per image on a 16 GB GPU
- Smart
model_cpu_offloadkeeps the ~8 GB text encoder from over-committing VRAM (no silent PCIe sysmem spill = no 100× slowdowns) - Apache 2.0 — yours to use
- Just say "generate / draw / render an image of …" — the picture lands in the Media Bay
Turn any image (generated, webcam, or uploaded) into a textured GLB:
- TRELLIS.2-4B FP8 in an isolated Python 3.12 worker (quantized, high quality, ungated)
- Hunyuan3D-2 mini turbo (0.6 B) as a compact local shape path
text_to_3d(prompt → image → model in one shot) orimage_to_3d- Renders inline in the Media Bay via
<model-viewer>— rotate, zoom, inspect
The VRAM manager automatically evicts FLUX and Whisper from the GPU before a 3D job takes over, so everything coexists on a single card.
~80 tools across 12 plugin groups. JARVIS picks the right one automatically.
📁 Files & Code
list_dir · read_file · write_file (auto-backup) · append_file · search_files ·
mkdir · copy · move · delete · stat · run_python · run_python_file ·
run_node · pip_install · npm_install · scan_project · code_debug_loop
💻 Shell & System
run_shell · run_powershell · which · env_var · system_status ·
list_processes · network_info · battery · gpu_info · disk_usage · kill_process
🌐 Web & Browser
web_search · image_search · video_search · web_fetch · open_inline ·
media_show · browser_open · browser_search · browser_page_text ·
browser_click_text · browser_type · browser_press · browser_screenshot ·
browser_close (visible Playwright Chromium)
👁️ Vision
webcam_see (snapshot + multimodal analysis) · webcam_show (live preview) ·
webcam_snapshot · screenshot · screen_ocr · ocr_image · analyze_image ·
webcam_status
🧠 Memory
remember · recall · forget · list_prefs · add_note · list_notes ·
search_memory · set_path · get_path · list_paths · record_tool ·
list_tools · add_fix · lookup_fix · top_commands · memory_galaxy
🎨 Media & Agents
image_generate · image_to_3d · text_to_3d · agent_backend_run
(delegate to Claude Code / Codex / OpenCode) · plus optional speak · listen ·
send_email · add_event · Home-Assistant smart-home control
Behind the butler sits a 9-role agent matrix, visible in the Swarm panel:
Planner → Executor → Research → Code → File → System → Memory → Vision → Voice
For long-running goals ("run until you finish X"), JARVIS delegates to installed CLI backends — Claude Code, Codex, or OpenCode — and streams their progress straight into the HUD.
A genuinely cinematic interface, not a reskin of a chat window:
- Arc-reactor core + concentric reactive rings that pulse with JARVIS's voice
- Voice-reactive orb driven by live mic amplitude
- Boot sequence on every launch (kernel init → voice profile → uplink → ready)
- Live panels — subsystems telemetry, terminal feed, system log, mission log, ambient telemetry, plan + swarm status
- Media Bay — image cards, embedded video, inline web pages, article reader, 3D model viewer, webcam feed, lightbox
- Mouse-parallax depth so the whole HUD floats in 3D space
Built with React 18 · TypeScript · Three.js (@react-three/fiber + drei) ·
Framer Motion · Zustand · TailwindCSS · Vite · Electron.
flowchart LR
subgraph Desktop["🖥️ Electron Desktop App"]
UI["React + Three.js HUD<br/>Orb · Arc Reactor · Media Bay"]
end
subgraph Backend["⚙️ FastAPI Backend :7341"]
WS["WebSocket chat + events"]
ORCH["Orchestrator<br/>agents · plan/reflect · permissions"]
TOOLS["80+ Tool Plugins"]
MEM["SQLite Memory<br/>+ embeddings"]
VRAM["VRAM Manager<br/>GPU arbitration"]
end
subgraph Models["🧬 Local Models (lazy)"]
LLM["Your LLM<br/>OpenAI-compat / Anthropic"]
FLUX["FLUX.2-klein<br/>image"]
T3D["TRELLIS.2 / Hunyuan3D<br/>3D"]
WHISP["faster-whisper<br/>STT"]
TTS["Piper / Riva / Edge<br/>TTS"]
end
UI <-->|WebSocket| WS --> ORCH
ORCH --> TOOLS --> MEM
ORCH --> LLM
TOOLS --> FLUX & T3D & WHISP & TTS
VRAM -. arbitrates .-> FLUX & T3D & WHISP
Every tool is tiered and gated:
| Tier | Behavior |
|---|---|
| 🟢 SAFE | Read-only / low-risk — runs automatically. |
| 🟡 CAUTION | Needs confirmation. |
| 🔴 DANGEROUS | Explicit approval, or refused outright. |
- A hard denylist blocks catastrophic commands (
rm -rf /,format c:, fork bombs…) regardless of mode. - Every action is logged to
data/logs/actions.jsonl. - Every file edit is backed up under
data/backups/(keeps the last 50). - Permission posture is configurable:
safe·confirm·auto·bypass.
⚠️ This is a powerful local agent with real access to your machine. Review the permission mode inconfig.yamlbefore pointing it at sensitive files or systems.
/help Show commands /memory ... Search local memory
/tools List all tools /mission ... Pin a project as focus
/agents List agent personas /reset Clear the in-memory chat
Large weights are not committed — they download lazily on first use. See the Model Asset Guide.
- FLUX GGUF →
data/models/ - Piper voice →
data/voices/(jgkawell/jarvis) - TRELLIS.2 / Hunyuan3D → auto-fetched into
data/models/by their workers
jarvis/ Python backend — server routes, orchestrator, agents, tool plugins
web/ React + Three.js HUD frontend
electron/ Desktop shell
docs/ Model download guide, research notes, screenshots
data/ Local runtime state — memory.db, logs, backups, generated media (git-ignored)
config.yaml All configuration (persona, model, voice, permissions, plugins)
.env Secrets (git-ignored)
- Linux / macOS launch scripts
- More image models in the command-bar switcher
- Voice-cloning for a fully custom TTS profile
- Plugin marketplace / hot-reload
JARVIS thrives on community tools. Adding a new capability is ~30 lines in a single
file — drop a module in jarvis/plugins/ and the registry auto-discovers it on boot.
See CONTRIBUTING.md for the 5-minute tool tutorial, dev setup, and
the safety rules.
Good first contributions: a new tool, a HUD panel, another model/voice backend, or docs.
If JARVIS made you smile, a star helps the next person building their own lab find it.
Built for the lab. Powered by your own hardware.
"Subsystems nominal. Standing by, sir."


