Skip to content

Dix01/JARVIS

J.A.R.V.I.S.

Just A Rather Very Intelligent System — your own local-first AI mission control.

A voice-driven assistant wrapped in an Iron-Man holographic HUD. It talks back in a British butler voice, sees through your webcam, generates images and 3D models on your own GPU, remembers everything across sessions, and commands your PC through 80+ tools — all running on your machine. No cloud required.


Platform Python React FastAPI Three.js Electron

Local First LLM Agnostic Image 3D Voice PRs Welcome


Star on GitHub    Fork on GitHub

Your own Iron-Man AI — voice, vision, image + 3D gen, 80+ tools, 100% on your hardware.
⭐ Star it if you'd run this on your machine.


J.A.R.V.I.S. HUD


⚡ What is this?

J.A.R.V.I.S. is a desktop AI command center for Windows. Point it at any OpenAI-compatible model (local or cloud), and you get a cinematic, full-screen holographic interface that you talk to like Tony Stark talks to his lab.

It is not a chat box. It is a window manager built out of tools — every action happens inside the UI. Ask it to generate an image and the picture materializes in the Media Bay. Ask it to look at you and the webcam opens. Ask what it remembers and a 3D galaxy of your data unfolds. Say "Hey JARVIS" and it wakes; say "cancel" and it stops talking.

You:    "Hey JARVIS, generate a chrome helmet on a black background, then make it 3D."
JARVIS: "Right away, sir."  →  FLUX renders the image  →  TRELLIS lifts it into a
        rotatable GLB hologram in the Media Bay. ~Seconds, fully local, on your GPU.

🌌 Why JARVIS?

Cloud assistants rent you intelligence and read your data. JARVIS doesn't.

  • 🔒 100% local-first — your voice, webcam, files, and memory never leave the machine. No telemetry, no account, no "your chats may be reviewed to improve our models."
  • 💸 No subscription — bring a model you already run (Ollama · LM Studio · vLLM) or a key you control. No monthly seat, no per-token meter.
  • 🧰 It actually does things — not a chat window. It runs code, drives a real browser, sees through your camera, generates images and 3D models, and remembers — across 80+ tools.
  • 🎬 It feels like the films — wake word, British-butler voice, holographic HUD, arc-reactor core. Built to be lived in, not just demoed once.
  • 🪟 Yours to hack — open, plugin-based, fully inspectable. Add a new tool in a single file.

The whole point: an assistant as capable as the cloud ones — that you actually own.


✨ Highlights

🎙️ Full voice loop Wake-word ("Hey JARVIS"), server-side Whisper STT, British-butler TTS, barge-in cancel, time-aware greetings.
🧠 Real memory Semantic cross-session recall via embeddings, notes, fixes, paths, prefs — visualized as a 3D Neural Galaxy.
🖼️ Local image gen FLUX.2-klein-4B (Q6_K GGUF, 4 steps) renders in seconds on a 16 GB GPU — no API, no watermark.
🧊 Local image→3D TRELLIS.2-4B FP8 / Hunyuan3D-2 mini turn any image into a textured GLB, rendered inline.
👁️ Vision One-shot webcam analysis, live preview, screen capture, OCR, drag-and-drop image Q&A.
🌐 Web + browser Web/image/video search as cards, readable page fetch, and a visible Playwright-driven Chromium.
🛠️ Owns your PC Files, shell, PowerShell, Python/Node sandboxes, package installs, live CPU/GPU/RAM/disk/net telemetry.
🤖 Agent swarm A 9-role agent matrix with an autonomous plan→execute→reflect loop, plus delegation to Claude Code / Codex / OpenCode.
🪟 The HUD Arc-reactor core, voice-reactive orb, parallax depth, boot sequence, live terminal — React + Three.js.
🔒 Safe by design SAFE / CAUTION / DANGEROUS tool tiers, a hard denylist, action logs, and automatic edit backups.

📸 Gallery

Boot sequence

Cold boot — kernel init, natural-language core, British voice profile, secure local uplink.

Holographic cockpit

Full cockpit — arc-reactor core, live subsystem telemetry, terminal + mission log, dialogue.


🚀 Quick Start

Requirements: Windows 10/11 · Python 3.10+ · Node 18+ · an NVIDIA GPU (recommended, for image/3D/Whisper) · and an LLM endpoint (e.g. Ollama or LM Studio running locally — or an API key).

:: 1. Clone
git clone https://github.com/Dix01/JARVIS.git
cd JARVIS

:: 2. Install backend venv + frontend deps + Electron
setup.bat

:: 3. Configure
copy .env.example .env
notepad .env          :: put your API key(s) here (optional for local models)
notepad config.yaml   :: point `model.endpoint` at your LLM

:: 4. Launch
run.bat

Then open the desktop app (Electron) — or visit:

http://127.0.0.1:7341

Developing? run-dev.bat gives you hot-reload on the frontend.

💡 No GPU? It still runs. Image/3D/Whisper degrade gracefully and heavy models load lazily only when first used. Chat, web, memory, vision-via-API, and the full HUD work on any machine.


🧩 Configure the model

J.A.R.V.I.S. speaks to any OpenAI-compatible /v1 endpoint or the Anthropic Messages API. Edit config.yaml:

model:
  provider: openai_compatible          # or: anthropic
  endpoint: http://localhost:11434/v1  # Ollama, LM Studio, vLLM, OpenRouter, OpenAI…
  model: your-model-name
  api_key_env: OLLAMA_API_KEY          # the .env variable to read the key from
  native_tool_calls: true
  temperature: 0.3
Provider endpoint example Notes
Ollama http://localhost:11434/v1 Free, local. Pull a tool-capable model.
LM Studio http://localhost:1234/v1 Free, local, GUI.
vLLM http://localhost:8000/v1 Self-hosted, fast.
OpenRouter https://openrouter.ai/api/v1 Hundreds of models, one key.
OpenAI https://api.openai.com/v1 GPT-4o etc.
Anthropic (set provider: anthropic) Claude via Messages API.

For the best experience, pick a model with strong native tool-calling.


🎙️ Voice

A complete, hands-free loop — engineered to feel like the films:

  • Wake word — say "Hey JARVIS" (fuzzy-matched, survives Whisper mishears like "jarvis / jervis / charvis"). A follow-up window keeps the mic armed so you don't repeat it every sentence.
  • Server-side STTfaster-whisper with VAD, hallucination filtering, and clip coalescing (a mid-sentence pause won't split your command into two).
  • "Mute mic except cancel" — while JARVIS speaks, the mic is muted to commands; say "cancel" (or "stop talking", "nevermind") and the speech cuts instantly.
  • Time-aware greeting — boots up with "Good morning/afternoon/evening, sir."
  • Cinematic TTS, in priority order:
Tier Engine Voice
1 Piper (local) jgkawell/jarvis ONNX — closest to the Paul-Bettany MCU voice
2 NVIDIA Riva (cloud) needs NVIDIA_API_KEY
3 Edge TTS (fallback) en-GB-ThomasNeural / RyanNeural — deep British male

The persona is a concise British butler-AI: status → diagnosis → recommended action. Addresses you as "sir", anticipates the next step, never rambles.


🧠 Memory & Intelligence

Everything is stored locally in SQLite — nothing leaves your machine.

  • Semantic recall — embeddings (e.g. nomic-embed-text) give true cross-session memory; gracefully falls back to keyword search if embeddings are unavailable.
  • Stores preferences, free-form notes, known fixes for recurring errors, labeled paths, installed tools, and your most-used commands.
  • Neural Galaxy — say "show me what you remember" and your entire memory store blooms as an interactive 3D point cloud, clustered by category.
  • Context compaction keeps long sessions coherent without blowing the token budget.
  • Autonomous planning — for complex goals, JARVIS runs a plan → execute → reflect loop, surfaced live in the Plan panel.
  • Proactive suggestions — after each action it offers the next likely step.

All of the above is toggleable under ultimate: in config.yaml.


🖼️ Local Media Generation

Images — FLUX.2-klein-4B

Black Forest Labs' fastest distilled model, run from a Q6_K GGUF (~3.3 GB on disk):

  • 4-step rectified-flow → seconds per image on a 16 GB GPU
  • Smart model_cpu_offload keeps the ~8 GB text encoder from over-committing VRAM (no silent PCIe sysmem spill = no 100× slowdowns)
  • Apache 2.0 — yours to use
  • Just say "generate / draw / render an image of …" — the picture lands in the Media Bay

3D — TRELLIS.2 / Hunyuan3D

Turn any image (generated, webcam, or uploaded) into a textured GLB:

  • TRELLIS.2-4B FP8 in an isolated Python 3.12 worker (quantized, high quality, ungated)
  • Hunyuan3D-2 mini turbo (0.6 B) as a compact local shape path
  • text_to_3d (prompt → image → model in one shot) or image_to_3d
  • Renders inline in the Media Bay via <model-viewer> — rotate, zoom, inspect

The VRAM manager automatically evicts FLUX and Whisper from the GPU before a 3D job takes over, so everything coexists on a single card.


🛠️ The Tool Belt

~80 tools across 12 plugin groups. JARVIS picks the right one automatically.

📁 Files & Code

list_dir · read_file · write_file (auto-backup) · append_file · search_files · mkdir · copy · move · delete · stat · run_python · run_python_file · run_node · pip_install · npm_install · scan_project · code_debug_loop

💻 Shell & System

run_shell · run_powershell · which · env_var · system_status · list_processes · network_info · battery · gpu_info · disk_usage · kill_process

🌐 Web & Browser

web_search · image_search · video_search · web_fetch · open_inline · media_show · browser_open · browser_search · browser_page_text · browser_click_text · browser_type · browser_press · browser_screenshot · browser_close (visible Playwright Chromium)

👁️ Vision

webcam_see (snapshot + multimodal analysis) · webcam_show (live preview) · webcam_snapshot · screenshot · screen_ocr · ocr_image · analyze_image · webcam_status

🧠 Memory

remember · recall · forget · list_prefs · add_note · list_notes · search_memory · set_path · get_path · list_paths · record_tool · list_tools · add_fix · lookup_fix · top_commands · memory_galaxy

🎨 Media & Agents

image_generate · image_to_3d · text_to_3d · agent_backend_run (delegate to Claude Code / Codex / OpenCode) · plus optional speak · listen · send_email · add_event · Home-Assistant smart-home control


🤖 Agent Swarm

Behind the butler sits a 9-role agent matrix, visible in the Swarm panel:

Planner → Executor → Research → Code → File → System → Memory → Vision → Voice

For long-running goals ("run until you finish X"), JARVIS delegates to installed CLI backends — Claude Code, Codex, or OpenCode — and streams their progress straight into the HUD.


🪟 The HUD

A genuinely cinematic interface, not a reskin of a chat window:

  • Arc-reactor core + concentric reactive rings that pulse with JARVIS's voice
  • Voice-reactive orb driven by live mic amplitude
  • Boot sequence on every launch (kernel init → voice profile → uplink → ready)
  • Live panels — subsystems telemetry, terminal feed, system log, mission log, ambient telemetry, plan + swarm status
  • Media Bay — image cards, embedded video, inline web pages, article reader, 3D model viewer, webcam feed, lightbox
  • Mouse-parallax depth so the whole HUD floats in 3D space

Built with React 18 · TypeScript · Three.js (@react-three/fiber + drei) · Framer Motion · Zustand · TailwindCSS · Vite · Electron.


🏗️ Architecture

flowchart LR
    subgraph Desktop["🖥️ Electron Desktop App"]
        UI["React + Three.js HUD<br/>Orb · Arc Reactor · Media Bay"]
    end
    subgraph Backend["⚙️ FastAPI Backend :7341"]
        WS["WebSocket chat + events"]
        ORCH["Orchestrator<br/>agents · plan/reflect · permissions"]
        TOOLS["80+ Tool Plugins"]
        MEM["SQLite Memory<br/>+ embeddings"]
        VRAM["VRAM Manager<br/>GPU arbitration"]
    end
    subgraph Models["🧬 Local Models (lazy)"]
        LLM["Your LLM<br/>OpenAI-compat / Anthropic"]
        FLUX["FLUX.2-klein<br/>image"]
        T3D["TRELLIS.2 / Hunyuan3D<br/>3D"]
        WHISP["faster-whisper<br/>STT"]
        TTS["Piper / Riva / Edge<br/>TTS"]
    end

    UI <-->|WebSocket| WS --> ORCH
    ORCH --> TOOLS --> MEM
    ORCH --> LLM
    TOOLS --> FLUX & T3D & WHISP & TTS
    VRAM -. arbitrates .-> FLUX & T3D & WHISP
Loading

🔒 Safety Model

Every tool is tiered and gated:

Tier Behavior
🟢 SAFE Read-only / low-risk — runs automatically.
🟡 CAUTION Needs confirmation.
🔴 DANGEROUS Explicit approval, or refused outright.
  • A hard denylist blocks catastrophic commands (rm -rf /, format c:, fork bombs…) regardless of mode.
  • Every action is logged to data/logs/actions.jsonl.
  • Every file edit is backed up under data/backups/ (keeps the last 50).
  • Permission posture is configurable: safe · confirm · auto · bypass.

⚠️ This is a powerful local agent with real access to your machine. Review the permission mode in config.yaml before pointing it at sensitive files or systems.


⌨️ Command Bar

/help        Show commands              /memory ...  Search local memory
/tools       List all tools            /mission ... Pin a project as focus
/agents      List agent personas       /reset       Clear the in-memory chat

📦 Model Assets

Large weights are not committed — they download lazily on first use. See the Model Asset Guide.

  • FLUX GGUF → data/models/
  • Piper voice → data/voices/ (jgkawell/jarvis)
  • TRELLIS.2 / Hunyuan3D → auto-fetched into data/models/ by their workers

🗂️ Project Layout

jarvis/        Python backend — server routes, orchestrator, agents, tool plugins
web/           React + Three.js HUD frontend
electron/      Desktop shell
docs/          Model download guide, research notes, screenshots
data/          Local runtime state — memory.db, logs, backups, generated media (git-ignored)
config.yaml    All configuration (persona, model, voice, permissions, plugins)
.env           Secrets (git-ignored)

🛣️ Roadmap / Ideas

  • Linux / macOS launch scripts
  • More image models in the command-bar switcher
  • Voice-cloning for a fully custom TTS profile
  • Plugin marketplace / hot-reload

🤝 Contributing

JARVIS thrives on community tools. Adding a new capability is ~30 lines in a single file — drop a module in jarvis/plugins/ and the registry auto-discovers it on boot. See CONTRIBUTING.md for the 5-minute tool tutorial, dev setup, and the safety rules.

Good first contributions: a new tool, a HUD panel, another model/voice backend, or docs.

PRs Welcome Good First Issues


⭐ Star History

If JARVIS made you smile, a star helps the next person building their own lab find it.

Star History Chart



Spread the word → Share on X · Reddit · Hacker News


Built for the lab. Powered by your own hardware.

"Subsystems nominal. Standing by, sir."

About

Local-first AI assistant for Windows: voice (Whisper+Piper), webcam vision, FLUX image gen, image-to-3D, memory, 80+ tools. OpenAI-compatible / Ollama / LM Studio. React + Three.js HUD, FastAPI.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors