Image/vision support for chat attachments

## Context

Chat attachments (shipped in rc/1.0.10) save dropped files to `sessions/<runId>/attachments/` and expose them to every agent via three auto-injected tools: `list_attachments`, `read_attachment`, `grep_attachment`. Text and code work; images do not — the tools are text-only, so when an agent calls `read_attachment` on a PNG it either gets garbled bytes or declines to try.

## Gap

Vision-capable models (Claude, most OpenAI, Ollama vision variants like llama3.2-vision / llava) *can* see images — but only when the image arrives as a proper image content block in the model's input, not as base64 text in a tool result. Today we have no path for that.

## What we discussed

Three things we want to avoid:

1. **Magic-byte sniffing** server-side — same maintenance trap as the MIME extension table we already tore out. Every new format (AVIF, HEIC, …) would force a sniffer update.
2. **Extension tables** — same smell.
3. **Inline base64 in tool results** — LLMs can't "see" a base64 string. Useless for vision.

## Proposal on the table (deferred)

**Option A — Browser-provided MIME + two explicit tools.**

- Client sends `{filename, contentBase64, mimeType: file.type}` on upload. Browser has already set `file.type` (e.g. `"image/png"`); we just pass it through.
- Server stores the mime alongside the file (per-file sidecar or run-level manifest).
- `list_attachments` includes the mime hint so the agent knows what it's looking at:
  ```
  - screenshot.png (480103 bytes, image/png)
  - foo.py (1234 bytes, text/x-python)
  ```
- Two tools, agent picks:
  - `read_attachment(filename, offset?, limit?)` → UTF-8 text content block.
  - `view_attachment(filename)` → image content block for vision-capable engines (Claude, OpenAI Responses), clean error for non-vision paths.

Zero format tables on our side, zero magic bytes. Agent dispatches based on filename/mime hint.

**Option B — Inject images directly into the first user message**, bypassing tools for images entirely. Closer to how ChatGPT/Claude.ai handle uploads. Requires per-engine message-building changes (each engine has a different image block shape).

A hybrid of A+B is also viable: images go through the first-message injection (for immediate vision); text/code stays tool-gated.

## Scope when we pick this up

- Per-engine image content block formats:
  - **Claude**: `{type: "image", source: {type: "base64", media_type, data}}` — native.
  - **OpenAI Responses**: `{type: "image_url", image_url: {url: "data:…"}}`.
  - **OpenAI Chat Completions**: same image_url shape, model-dependent support.
  - **Ollama**: `images: [base64]` on the message — only vision models.
- Figure out how `@anthropic-ai/claude-agent-sdk`'s `query()` accepts image-bearing messages (may only take strings today).
- Client: capture `file.type` at drop/upload.
- Server: persist mime per attachment, make available to `list_attachments`.

## Out of scope for this issue

- PDF/DOCX text extraction (separate reader pipeline, different issue).
- Audio/video support (separate issue if we ever need it).

## Links

- Discussed in session 2026-04-15 during chat drag-drop work on `rc/1.0.10`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image/vision support for chat attachments #56

Context

Gap

What we discussed

Proposal on the table (deferred)

Scope when we pick this up

Out of scope for this issue

Links

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Image/vision support for chat attachments #56

Description

Context

Gap

What we discussed

Proposal on the table (deferred)

Scope when we pick this up

Out of scope for this issue

Links

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions