A fine-tuned desktop companion model for OpenClaw β control your machine like magic.
PsiClaw is the training grounds, evaluation harness, and operator console for a specialized VLM (vision-language model) built on top of qwen3-vl-8b. It ships as the default desktop companion option inside OpenClaw for users who want a model that deeply understands and operates their macOS environment.
- Train a desktop-native agent β fine-tune qwen3-vl-8b to operate across the full macOS surface (browser, native apps, file system, terminal) with state-grounded reasoning.
- Build the operator console β a human-in-the-loop approval interface where every proposed action is visible, auditable, and controllable before execution.
- Establish evaluation infrastructure β benchmark the model across six dimensions (browser automation, API routing, native app navigation, terminal safety, confirmation discipline, memory/personalization).
- Ship a training harness β capture traces, scenarios, and recovery episodes as structured training data for continuous fine-tuning via LoRA on Apple Silicon.
PsiClaw operates across your entire computing environment β not just the browser:
- Browser β navigate, search, fill forms, verify evidence, multi-hop research
- Native apps β VS Code, Terminal, Slack, Finder, and more via system accessibility + Peekaboo
- File system β read, organize, and act on files with appropriate confirmation gates
- Terminal β execute commands with safety checks; detect and block risky or irreversible operations
- Memory β persistent user identity via OpenTrust (prediction-based reasoning, not static retrieval)
It prefers direct API calls over browser DOM automation when an API skill is available (Unbrowse pattern), falling back to visual automation only when needed.
| Browser agent | PsiClaw (desktop companion) |
|---|---|
| Web-only | Web + native apps + file system + terminal |
| Stateless per session | Persistent identity across sessions |
| Task executor | Partner with context and memory |
| No OS awareness | Full OS and workflow awareness |
| No personalization | Adapts to user patterns over time |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Operator Console (UI) β
β Next.js 16 Β· React 19 Β· Tailwind 4 Β· shadcn/ui (base-nova) β
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββββββββββ€
β Overview β Console β Gym β Traces β Evals β
β (landing)β (approve β (train β (replay β (benchmark β
β β / deny) β tasks) β / audit)β dashboard) β
ββββββββ¬ββββ΄βββββββ¬ββββ΄βββββββ¬ββββ΄βββββββ¬ββββ΄βββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PsiClaw Agent Core β
β β
β βββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββ β
β β Qwen-Agentβ β Unbrowse β β OpenTrustβ β Peekaboo β β
β β (native β β (API-first β β (memory β β (native β β
β β tool β β routing) β β as β β app β β
β β calling) β β β β reason- β β access) β β
β β β β β β ing) β β β β
β βββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββ β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β qwen3-vl-8b (Base Model) β
β MLX 4BIT (5.78 GB) Β· MLX 8BIT (9.87 GB) β
β Interleaved-MRoPE Β· DeepStack Β· 256K context window β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Role |
|---|---|
| qwen3-vl-8b | Base vision-language model β Interleaved-MRoPE for temporal reasoning, DeepStack for fine-grained UI element identification, 256K context window |
| Qwen-Agent | First-party agent framework with native tool calling for the Qwen model family |
| Unbrowse | API-first routing layer β calls web service APIs directly when available, falls back to DOM automation otherwise (100x faster, 80% cheaper) |
| OpenTrust | Memory-as-reasoning layer β predicts user needs from interaction history rather than static retrieval; implements predict β check β update cycle |
| Peekaboo | Native macOS app observation via system accessibility APIs β reads menus, windows, dialogs, and focused elements |
- Observe β capture full desktop state: active windows, browser state, running processes, clipboard, and recent interaction history
- Route β determine whether an API skill is available; prefer direct API over DOM automation
- Propose β present the next action with confidence score, rationale, and risk level; request approval for irreversible or high-impact steps
- Execute β run in a scoped, observable way; re-read state after each action before deciding the next step
- Trace β capture the full chain (observation β reasoning β action β outcome) for replay, evals, and fine-tuning
| Base model | qwen3-vl-8b |
| Format (Mac mini 16GB) | MLX 4BIT (5.78 GB) |
| Format (M3 Max / M4 Pro) | MLX 8BIT (9.87 GB) |
| Min hardware for users | 8GB RAM (runs at ~5.8 GB) |
| Context window | 256K tokens |
| Agent framework | Qwen-Agent (first-party, native tool calling) |
| Personalization | OpenTrust (memory as reasoning) |
| Efficiency layer | Unbrowse (API-first routing) |
| Fine-tuning | LoRA / QLoRA via mlx-lm on Apple Silicon |
| Distribution | Default model option in OpenClaw desktop companion mode |
Head-to-head benchmarks on identical hardware (M4 Mac, local inference) against qwen2.5-vl-7b:
| Task | qwen2.5-vl-7B | qwen3-vl-8B |
|---|---|---|
| Visual perception | 5/10 | 8/10 |
| Visual captioning | 6.5/10 | 9/10 |
| Visual reasoning | 8/10 | 9/10 |
| Multimodal fusion | 7/10 | 9/10 |
| Instruction following | 8/10 | 8.5/10 |
Key architectural upgrades (arxiv:2511.21631):
- Interleaved-MRoPE β reasons about what changed between screenshots, not just individual frames
- DeepStack β multi-level ViT features for tighter vision-language alignment and more accurate UI element targeting
- Text-based time alignment β explicit textual timestamp alignment for trace replay and session history reasoning
Six benchmark suites measured across model checkpoints:
| Suite | Success rate | Interventions | Notes |
|---|---|---|---|
| Browser form fill | 88% | 1.4 / run | Needs better handling for auth flows and dynamic modals |
| API-first routing | 96% | 0.2 / run | Strong skill matching; correct DOM fallback when API unavailable |
| Native app navigation | 81% | 1.9 / run | Main frontier area β cross-app workflows need more training data |
| Terminal safety | 99% | 0.1 / run | Excellent detection of risky commands and irreversible writes |
| Confirmation discipline | 100% | β / run | No irreversible action executed without operator approval in any run |
| Memory + personalization | 72% | 0.8 / run | OpenTrust layer reduces redundant questions over time; still improving |
Training is informed by seven research lineages:
| Source | Core lesson applied |
|---|---|
| OpenAI Harness Engineering | Environment legibility determines agent capability |
| OpenAI BrowseComp | Persistence + search reformulation beat one-shot retrieval |
| Karpathy / Paper Lantern | Literature-grounded configs outperform intuition by 3.2% |
| Microsoft LIDA | Generate β evaluate β repair loops for verifiable code |
| Plastic Labs Honcho | Memory quality is benchmarkable; personalization needs dedicated evals |
| Plastic Labs β Memory as Reasoning | Memory is prediction, not storage; reasoning > retrieval |
| Unbrowse | Skip DOM when API is discoverable β 100x faster, 80% cheaper |
- Harness first, model second β the bottleneck is environment legibility, not model capability alone
- State grounding over pattern matching β decisions must be justified by actual current state, not assumptions
- Recovery as a first-class skill β training data over-indexes on recovery episodes, not just successful forward paths
- Generate β evaluate β repair β include full repair trajectories in training data, not just correct-on-first-try examples
- Paper Lantern loop β before each major training iteration, scan recent literature for applicable improvements
# LoRA fine-tuning on Apple Silicon
pip install mlx-lm
mlx_lm.lora \
--model qwen3-vl-8b \
--train \
--data ./data/desktop-agent-train.jsonl \
--iters 1000 \
--batch-size 4 \
--lora-layers 16PsiClaw operates under a confirmation-first safety policy:
- Always confirm before: purchases, payments, sending messages, deleting data, submitting sensitive forms, changing account settings, publishing content, accepting permissions
- No irreversible action without operator approval β 100% confirmation discipline across all eval runs
- Authentication boundaries respected β login, 2FA, CAPTCHA, and biometric steps are escalated to the user, never bypassed
- Suspicious content surfaced β deceptive pages, inconsistent behavior, or unexpected state changes are flagged for review
- Reversible actions preferred β move over delete, save over overwrite, branch over force-push
psi-claw/
βββ docs/
β βββ psiclaw-mvp-plan.md # MVP definition, scope, phases, and exit criteria
β βββ psiclaw-training-plan.md # Full training design + research sources
β βββ psiclaw-system-prompt-v2.md # System prompt used for fine-tuning
βββ src/
β βββ app/
β β βββ page.tsx # Overview / landing page
β β βββ layout.tsx # Root layout (Geist font, dark mode)
β β βββ globals.css # Theme (OKLch colors, radial gradients)
β β βββ console/page.tsx # Operator console β observe, approve, deny
β β βββ gym/page.tsx # Desktop Gym β task scenarios for training
β β βββ traces/page.tsx # Trace explorer β replay + audit
β β βββ evals/page.tsx # Eval dashboard β success rate, interventions
β βββ components/
β β βββ app-shell.tsx # AppShell, Panel, StatCard layout components
β β βββ ui/ # shadcn/ui components (base-nova style)
β βββ lib/
β βββ demo-data.ts # Scenarios, task data, eval rows, mock state
β βββ utils.ts # Utility helpers (cn class merger)
βββ components.json # shadcn/ui configuration
βββ next.config.ts # Next.js 16 configuration
βββ tsconfig.json # TypeScript config (strict, path aliases)
βββ package.json # Dependencies and scripts
βββ pnpm-workspace.yaml # pnpm workspace config
| Layer | Technology |
|---|---|
| Framework | Next.js 16.2.1 |
| UI library | React 19.2.4 |
| Language | TypeScript 5 |
| Styling | Tailwind CSS 4 + shadcn/ui (base-nova) |
| Icons | lucide-react |
| Component utilities | class-variance-authority, clsx, tailwind-merge |
| Package manager | pnpm |
| Color system | OKLch (dark-first, violet primary, cyan accent) |
| Font | Geist Sans / Geist Mono |
pnpm install
pnpm devOpen http://localhost:3000.
pnpm build # Production build
pnpm start # Start production server
pnpm lint # Run ESLintPsiClaw connects to several components in the broader OpenKnots ecosystem:
- OpenClaw β the parent agent platform; PsiClaw ships as the default desktop companion model option
- OpenTrust β the memory-as-reasoning layer; implements predict β check β update cycle with surprisal-weighted memory formation
- Unbrowse β API-first routing plugin (
lekt9/unbrowse-openclaw); integrated as the efficiency layer beneath the desktop companion - Nova β the personalization agent; its memory layer implements the same reasoning-based identity model informed by Honcho/Plastic Labs research
v0.1.0 β The UI harness is built with demo data. No live ML backend is wired up yet.
The immediate goal is not full desktop coverage. The next milestone is a narrow MVP:
- one supported task surface
- live operator approvals
- persisted traces
- basic eval reporting from real runs
See docs/psiclaw-mvp-plan.md for the detailed roadmap from the current prototype to MVP.
- Lock the MVP wedge defined in
docs/psiclaw-mvp-plan.md - Replace demo-only console state with a live operator loop
- Persist traces and derive evals from real runs
- Create
data/directory with training data collection structure - Collect 100 desktop interaction traces (success + failure + recovery)
- Run BrowseComp hard-find eval on base qwen3-vl-8b
- First LoRA fine-tuning experiment (local MLX on Apple Silicon)
- Paper Lantern scan: get literature recommendations before first training run
- Compare base vs fine-tuned on state-grounding eval set
- Prototype memory-as-reasoning layer for OpenTrust using Honcho as reference
- Wire live model inference into the operator console
MIT