Skip to content

Image/vision support for chat attachments #56

@BeinerChes

Description

@BeinerChes

Context

Chat attachments (shipped in rc/1.0.10) save dropped files to sessions/<runId>/attachments/ and expose them to every agent via three auto-injected tools: list_attachments, read_attachment, grep_attachment. Text and code work; images do not — the tools are text-only, so when an agent calls read_attachment on a PNG it either gets garbled bytes or declines to try.

Gap

Vision-capable models (Claude, most OpenAI, Ollama vision variants like llama3.2-vision / llava) can see images — but only when the image arrives as a proper image content block in the model's input, not as base64 text in a tool result. Today we have no path for that.

What we discussed

Three things we want to avoid:

  1. Magic-byte sniffing server-side — same maintenance trap as the MIME extension table we already tore out. Every new format (AVIF, HEIC, …) would force a sniffer update.
  2. Extension tables — same smell.
  3. Inline base64 in tool results — LLMs can't "see" a base64 string. Useless for vision.

Proposal on the table (deferred)

Option A — Browser-provided MIME + two explicit tools.

  • Client sends {filename, contentBase64, mimeType: file.type} on upload. Browser has already set file.type (e.g. "image/png"); we just pass it through.
  • Server stores the mime alongside the file (per-file sidecar or run-level manifest).
  • list_attachments includes the mime hint so the agent knows what it's looking at:
    - screenshot.png (480103 bytes, image/png)
    - foo.py (1234 bytes, text/x-python)
    
  • Two tools, agent picks:
    • read_attachment(filename, offset?, limit?) → UTF-8 text content block.
    • view_attachment(filename) → image content block for vision-capable engines (Claude, OpenAI Responses), clean error for non-vision paths.

Zero format tables on our side, zero magic bytes. Agent dispatches based on filename/mime hint.

Option B — Inject images directly into the first user message, bypassing tools for images entirely. Closer to how ChatGPT/Claude.ai handle uploads. Requires per-engine message-building changes (each engine has a different image block shape).

A hybrid of A+B is also viable: images go through the first-message injection (for immediate vision); text/code stays tool-gated.

Scope when we pick this up

  • Per-engine image content block formats:
    • Claude: {type: "image", source: {type: "base64", media_type, data}} — native.
    • OpenAI Responses: {type: "image_url", image_url: {url: "data:…"}}.
    • OpenAI Chat Completions: same image_url shape, model-dependent support.
    • Ollama: images: [base64] on the message — only vision models.
  • Figure out how @anthropic-ai/claude-agent-sdk's query() accepts image-bearing messages (may only take strings today).
  • Client: capture file.type at drop/upload.
  • Server: persist mime per attachment, make available to list_attachments.

Out of scope for this issue

  • PDF/DOCX text extraction (separate reader pipeline, different issue).
  • Audio/video support (separate issue if we ever need it).

Links

  • Discussed in session 2026-04-15 during chat drag-drop work on rc/1.0.10.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions