A desktop screen annotation assistant powered by AI vision models. Draw on your screen and talk to an AI about what you see.
- Screen Annotation: Draw directly on your screen with mouse/trackpad
- Voice Input: Speak your questions while drawing
- AI Vision: Powered by OpenAI-compatible vision models (local or cloud)
- Text-to-Speech: Natural voice responses via MOSS-TTS-Nano
- UI Understanding: Optional OmniParser integration for enhanced screen element detection
- Web Search: Live web search capability via Tavily API
- Computer Control: AI can interact with your computer (click, type, scroll, etc.)
- Press Alt+D to enter drawing mode
- Draw on screen with your mouse (red ink) and speak to describe your question
- Release Alt to send — the app captures the screenshot with your voice and sends it to the AI
- The AI's response appears in the bottom-right corner and is spoken aloud
The app runs as a transparent always-on-top overlay. When not in drawing mode, all mouse events pass through to underlying windows.
- Node.js (v16 or later)
- Python 3.10+ (for TTS and OmniParser)
- uv (recommended, for faster Python dependency installation)
curl -LsSf https://astral.sh/uv/install.sh | sh - LLM Server: A running OpenAI-compatible API endpoint (e.g., LM Studio, Ollama, or OpenAI)
git clone --recursive https://github.com/HuangHan96/focus.git
cd focusIf you already cloned without --recursive:
git submodule update --init --recursivenpm installCreate a .env file in the project root:
# Required: LLM Configuration
LLM_BASE=http://localhost:1234/v1
LLM_MODEL=qwen2.5-vl-7b-instruct
LLM_API_KEY=lm_studio
# Optional: Web Search (required for web search tool)
TAVILY_API_KEY=your_tavily_api_key_here
# Optional: Vision and OmniParser
VISION_ENABLE=true
LLM_INCLUDE_OMNIPARSER_IMAGE=true
LLM_OMNIPARSER_SUMMARY_LIMIT=36
LLM_OMNIPARSER_CONTENT_TRUNCATE=64Note: TTS and OmniParser dependencies will be automatically installed on first run. No manual setup required!
npm startOn first launch, the app will:
- Create Python virtual environments for TTS and OmniParser
- Install all required dependencies (this may take 5-10 minutes)
- Download model weights from HuggingFace
- Start the TTS and OmniParser services
Subsequent launches will be much faster as dependencies are already installed.
The app works with any OpenAI-compatible API. Popular options:
LM Studio (recommended for local models):
- Download from lmstudio.ai
- Load a vision model (e.g.,
qwen2.5-vl-7b-instruct) - Start the local server (default:
http://localhost:1234/v1)
Ollama:
ollama serve
# Set LLM_BASE=http://localhost:11434/v1OpenAI:
LLM_BASE=https://api.openai.com/v1
LLM_MODEL=gpt-4o
LLM_API_KEY=sk-your-api-keyWeb Search (via Tavily):
- Get API key from tavily.com
- Add
TAVILY_API_KEYto.env
Computer Control:
- Enabled by default
- AI can click, type, scroll, and interact with your screen
- Uses OmniParser for UI element detection
- Alt+D: Enter drawing mode
- Alt (hold): Draw on screen
- Alt (release): Send to AI
- Cmd+Q (Mac) / Alt+F4 (Windows): Quit app
The app automatically installs dependencies on first run. If setup fails:
- Check Python version:
python3 --version(must be 3.10+) - Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - Manual cleanup:
rm -rf MOSS-TTS-Nano/.venv OmniParser/.venv OmniParser/weights npm start # Retry setup
If ports 18083 (TTS) or 18084 (OmniParser) are in use:
# Kill existing processes
lsof -ti:18083 | xargs kill
lsof -ti:18084 | xargs killOr set custom ports in .env:
MOSS_TTS_PORT=18085
OMNIPARSER_PORT=18086- Verify your LLM server is running
- Check
LLM_BASEURL is correct - Test with:
curl http://localhost:1234/v1/models
focus/
├── main.js # Electron main process
├── mask.html # Overlay UI
├── package.json # Node.js dependencies
├── .env # Configuration (create this)
├── MOSS-TTS-Nano/ # TTS submodule (auto-setup)
├── OmniParser/ # UI parser submodule (auto-setup)
└── tmp/ # Debug output (screenshots, audio)
Each interaction is saved to tmp/{timestamp}/:
frame_000.jpg— Screenshot with annotationsaudio.webm— Voice recordingdebug.json— Full request/response data
- Electron: Transparent overlay, screen capture, global shortcuts
- MOSS-TTS-Nano: Fast multilingual text-to-speech
- OmniParser: UI element detection and screen understanding
- Tavily: Web search API
- OpenAI API: Compatible with local and cloud LLMs
MIT
- MOSS-TTS-Nano by OpenMOSS
- OmniParser by Microsoft