Focus

A desktop screen annotation assistant powered by AI vision models. Draw on your screen and talk to an AI about what you see.

Features

Screen Annotation: Draw directly on your screen with mouse/trackpad
Voice Input: Speak your questions while drawing
AI Vision: Powered by OpenAI-compatible vision models (local or cloud)
Text-to-Speech: Natural voice responses via MOSS-TTS-Nano
UI Understanding: Optional OmniParser integration for enhanced screen element detection
Web Search: Live web search capability via Tavily API
Computer Control: AI can interact with your computer (click, type, scroll, etc.)

How it works

Press Alt+D to enter drawing mode
Draw on screen with your mouse (red ink) and speak to describe your question
Release Alt to send — the app captures the screenshot with your voice and sends it to the AI
The AI's response appears in the bottom-right corner and is spoken aloud

The app runs as a transparent always-on-top overlay. When not in drawing mode, all mouse events pass through to underlying windows.

Prerequisites

Node.js (v16 or later)
Python 3.10+ (for TTS and OmniParser)
uv (recommended, for faster Python dependency installation)
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
LLM Server: A running OpenAI-compatible API endpoint (e.g., LM Studio, Ollama, or OpenAI)

Quick Start

1. Clone the repository with submodules

git clone --recursive https://github.com/HuangHan96/focus.git
cd focus

If you already cloned without --recursive:

git submodule update --init --recursive

2. Install Node.js dependencies

npm install

3. Configure environment

Create a .env file in the project root:

# Required: LLM Configuration
LLM_BASE=http://localhost:1234/v1
LLM_MODEL=qwen2.5-vl-7b-instruct
LLM_API_KEY=lm_studio

# Optional: Web Search (required for web search tool)
TAVILY_API_KEY=your_tavily_api_key_here

# Optional: Vision and OmniParser
VISION_ENABLE=true
LLM_INCLUDE_OMNIPARSER_IMAGE=true
LLM_OMNIPARSER_SUMMARY_LIMIT=36
LLM_OMNIPARSER_CONTENT_TRUNCATE=64

Note: TTS and OmniParser dependencies will be automatically installed on first run. No manual setup required!

4. Start the app

npm start

On first launch, the app will:

Create Python virtual environments for TTS and OmniParser
Install all required dependencies (this may take 5-10 minutes)
Download model weights from HuggingFace
Start the TTS and OmniParser services

Subsequent launches will be much faster as dependencies are already installed.

Configuration

LLM Setup

The app works with any OpenAI-compatible API. Popular options:

LM Studio (recommended for local models):

Download from lmstudio.ai
Load a vision model (e.g., qwen2.5-vl-7b-instruct)
Start the local server (default: http://localhost:1234/v1)

Ollama:

ollama serve
# Set LLM_BASE=http://localhost:11434/v1

OpenAI:

LLM_BASE=https://api.openai.com/v1
LLM_MODEL=gpt-4o
LLM_API_KEY=sk-your-api-key

Optional Features

Web Search (via Tavily):

Get API key from tavily.com
Add TAVILY_API_KEY to .env

Computer Control:

Enabled by default
AI can click, type, scroll, and interact with your screen
Uses OmniParser for UI element detection

Keyboard Shortcuts

Alt+D: Enter drawing mode
Alt (hold): Draw on screen
Alt (release): Send to AI
Cmd+Q (Mac) / Alt+F4 (Windows): Quit app

Troubleshooting

TTS or OmniParser fails to start

The app automatically installs dependencies on first run. If setup fails:

Check Python version: python3 --version (must be 3.10+)
Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh

Manual cleanup:

rm -rf MOSS-TTS-Nano/.venv OmniParser/.venv OmniParser/weights
npm start  # Retry setup

Port conflicts

If ports 18083 (TTS) or 18084 (OmniParser) are in use:

# Kill existing processes
lsof -ti:18083 | xargs kill
lsof -ti:18084 | xargs kill

Or set custom ports in .env:

MOSS_TTS_PORT=18085
OMNIPARSER_PORT=18086

LLM connection issues

Verify your LLM server is running
Check LLM_BASE URL is correct
Test with: curl http://localhost:1234/v1/models

Project Structure

focus/
├── main.js              # Electron main process
├── mask.html            # Overlay UI
├── package.json         # Node.js dependencies
├── .env                 # Configuration (create this)
├── MOSS-TTS-Nano/       # TTS submodule (auto-setup)
├── OmniParser/          # UI parser submodule (auto-setup)
└── tmp/                 # Debug output (screenshots, audio)

Debug

Each interaction is saved to tmp/{timestamp}/:

frame_000.jpg — Screenshot with annotations
audio.webm — Voice recording
debug.json — Full request/response data

Tech Stack

Electron: Transparent overlay, screen capture, global shortcuts
MOSS-TTS-Nano: Fast multilingual text-to-speech
OmniParser: UI element detection and screen understanding
Tavily: Web search API
OpenAI API: Compatible with local and cloud LLMs

License

MIT

Credits

MOSS-TTS-Nano by OpenMOSS
OmniParser by Microsoft

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
MOSS-TTS-Nano @ fd2b12a		MOSS-TTS-Nano @ fd2b12a
OmniParser @ 0497561		OmniParser @ 0497561
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
computer_control.py		computer_control.py
main.js		main.js
mask.html		mask.html
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Focus

Features

How it works

Prerequisites

Quick Start

1. Clone the repository with submodules

2. Install Node.js dependencies

3. Configure environment

4. Start the app

Configuration

LLM Setup

Optional Features

Keyboard Shortcuts

Troubleshooting

TTS or OmniParser fails to start

Port conflicts

LLM connection issues

Project Structure

Debug

Tech Stack

License

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Focus

Features

How it works

Prerequisites

Quick Start

1. Clone the repository with submodules

2. Install Node.js dependencies

3. Configure environment

4. Start the app

Configuration

LLM Setup

Optional Features

Keyboard Shortcuts

Troubleshooting

TTS or OmniParser fails to start

Port conflicts

LLM connection issues

Project Structure

Debug

Tech Stack

License

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages