Context
Chat attachments (shipped in rc/1.0.10) save dropped files to sessions/<runId>/attachments/ and expose them to every agent via three auto-injected tools: list_attachments, read_attachment, grep_attachment. Text and code work; images do not — the tools are text-only, so when an agent calls read_attachment on a PNG it either gets garbled bytes or declines to try.
Gap
Vision-capable models (Claude, most OpenAI, Ollama vision variants like llama3.2-vision / llava) can see images — but only when the image arrives as a proper image content block in the model's input, not as base64 text in a tool result. Today we have no path for that.
What we discussed
Three things we want to avoid:
- Magic-byte sniffing server-side — same maintenance trap as the MIME extension table we already tore out. Every new format (AVIF, HEIC, …) would force a sniffer update.
- Extension tables — same smell.
- Inline base64 in tool results — LLMs can't "see" a base64 string. Useless for vision.
Proposal on the table (deferred)
Option A — Browser-provided MIME + two explicit tools.
- Client sends
{filename, contentBase64, mimeType: file.type} on upload. Browser has already set file.type (e.g. "image/png"); we just pass it through.
- Server stores the mime alongside the file (per-file sidecar or run-level manifest).
list_attachments includes the mime hint so the agent knows what it's looking at:
- screenshot.png (480103 bytes, image/png)
- foo.py (1234 bytes, text/x-python)
- Two tools, agent picks:
read_attachment(filename, offset?, limit?) → UTF-8 text content block.
view_attachment(filename) → image content block for vision-capable engines (Claude, OpenAI Responses), clean error for non-vision paths.
Zero format tables on our side, zero magic bytes. Agent dispatches based on filename/mime hint.
Option B — Inject images directly into the first user message, bypassing tools for images entirely. Closer to how ChatGPT/Claude.ai handle uploads. Requires per-engine message-building changes (each engine has a different image block shape).
A hybrid of A+B is also viable: images go through the first-message injection (for immediate vision); text/code stays tool-gated.
Scope when we pick this up
- Per-engine image content block formats:
- Claude:
{type: "image", source: {type: "base64", media_type, data}} — native.
- OpenAI Responses:
{type: "image_url", image_url: {url: "data:…"}}.
- OpenAI Chat Completions: same image_url shape, model-dependent support.
- Ollama:
images: [base64] on the message — only vision models.
- Figure out how
@anthropic-ai/claude-agent-sdk's query() accepts image-bearing messages (may only take strings today).
- Client: capture
file.type at drop/upload.
- Server: persist mime per attachment, make available to
list_attachments.
Out of scope for this issue
- PDF/DOCX text extraction (separate reader pipeline, different issue).
- Audio/video support (separate issue if we ever need it).
Links
- Discussed in session 2026-04-15 during chat drag-drop work on
rc/1.0.10.
Context
Chat attachments (shipped in rc/1.0.10) save dropped files to
sessions/<runId>/attachments/and expose them to every agent via three auto-injected tools:list_attachments,read_attachment,grep_attachment. Text and code work; images do not — the tools are text-only, so when an agent callsread_attachmenton a PNG it either gets garbled bytes or declines to try.Gap
Vision-capable models (Claude, most OpenAI, Ollama vision variants like llama3.2-vision / llava) can see images — but only when the image arrives as a proper image content block in the model's input, not as base64 text in a tool result. Today we have no path for that.
What we discussed
Three things we want to avoid:
Proposal on the table (deferred)
Option A — Browser-provided MIME + two explicit tools.
{filename, contentBase64, mimeType: file.type}on upload. Browser has already setfile.type(e.g."image/png"); we just pass it through.list_attachmentsincludes the mime hint so the agent knows what it's looking at:read_attachment(filename, offset?, limit?)→ UTF-8 text content block.view_attachment(filename)→ image content block for vision-capable engines (Claude, OpenAI Responses), clean error for non-vision paths.Zero format tables on our side, zero magic bytes. Agent dispatches based on filename/mime hint.
Option B — Inject images directly into the first user message, bypassing tools for images entirely. Closer to how ChatGPT/Claude.ai handle uploads. Requires per-engine message-building changes (each engine has a different image block shape).
A hybrid of A+B is also viable: images go through the first-message injection (for immediate vision); text/code stays tool-gated.
Scope when we pick this up
{type: "image", source: {type: "base64", media_type, data}}— native.{type: "image_url", image_url: {url: "data:…"}}.images: [base64]on the message — only vision models.@anthropic-ai/claude-agent-sdk'squery()accepts image-bearing messages (may only take strings today).file.typeat drop/upload.list_attachments.Out of scope for this issue
Links
rc/1.0.10.