Agentic Automation on iOS — Winning Team @ Claude Builder Hackathon 2026

A voice-controlled AI agent that operates a real iPhone — built on Claude.

🏆 Winning team at the Claude Builder Hackathon — UCLA, April 2026. 📄 Also an open research platform for agentic automation on physical iOS (and the iOS Simulator) — built for other researchers to extend. See RESEARCH.md for contributions, findings, and citation info.

Hold the Action Button, speak a task, and watch Claude take over your phone — narrating every step out loud so blind and low-vision users can use any iOS app.

The Problem

iOS accessibility hasn't caught up with the AI era.

2.2 billion people globally have a vision impairment
43 million people are blind worldwide
Only ~5% of mobile apps fully meet WCAG accessibility guidelines

Sources: World Health Organization · Accessibility Works (2025)

VoiceOver helps, but it's a screen reader — it doesn't do things for you. Siri can do things, but only inside Apple's own apps and only as single-step commands. There's no agent on iOS that can look at any app, reason about it, and complete a multi-step task on your behalf.

We built one.

Why this matters even post–Siri 2.0

Apple's automation story is tied to apps opting in. Siri today works reliably inside Apple-native apps (Messages, Calendar, Maps); support for third-party apps depends on developers shipping AppIntents / the rumored App Intents MCP surface. That's a multi-year ramp — and for every small or long-tail app that never ships one, Siri can't help at all.

A vision-based agent that can look at any screen and act closes that gap immediately. It doesn't wait for developers to adopt anything. It works today on LinkedIn, CVS, your bank's app, and whatever else is already on the user's phone. When Siri 2.0 ships, this same capability becomes the fallback for the long tail Siri won't cover on day one.

What We Built

Press and hold the iPhone Action Button, speak a command — "send my grandson a message," "refill my prescription on the CVS app," "check my calendar for tomorrow" — and Claude takes over.

The interaction: press and hold, speak the task, release.

Claude sees your screen every step (vision + UI hierarchy)
Claude picks the next action from 20+ tools (tap, type, scroll, swipe, drag, openApp, askUser, …)
AVSpeechSynthesizer narrates what's happening so blind users hear exactly what the agent is doing
Dynamic Island shows live progress with a stop button
Human-in-the-loop: when Claude is uncertain (e.g., two contacts named Kenny), it pauses and asks via the Dynamic Island

It works on any installed app — not just Apple's. That's the unlock.

Left: Dynamic Island shows real-time progress — current step, phase, elapsed time, and a live stop button. Right: Human-in-the-loop — Claude pauses and asks before any irreversible action (send a message, place a call, delete data).

Demo

Demo 1 — "Open my latest recruiter message on LinkedIn"

4 steps, 21.6s end-to-end. Claude opens LinkedIn, navigates to Messages, identifies the most recent recruiter thread (a Capital One InMail from Justin Dryer), and opens it.

▶ Watch the demo (opens in browser)

Every task is logged on-device. The companion app shows the run as a card — duration, step count, model, and the full agent logs.

Task card in the companion app: 4 steps · 21.6s · claude-sonnet-4-6 · 47 lines of agent logs. Tap a run to replay Claude's reasoning step by step.

Agent log — click to expand

--- Step 1/25 (0.0s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 2.3s | Tokens: 1987 in / 86 out | Cache: 0 read / 4576 write
[AI] I can see the Mobile Agent app is already processing the task "Open my latest recruiter message on LinkedIn". Let me open LinkedIn directly to complete this task.
[Tool] openApp({"appName":"LinkedIn"})
[Tool] → Opened LinkedIn (1.5s)
[Auto-capture] Screenshot queued after openApp
[Auto-UI] 13 elements bundled with screenshot
[Screenshot] OK (553 KB, 0.6s)
[Step 1] 4.4s (AI: 2.3s | Tools: 1.5s | Screenshot: 0.6s)

--- Step 2/25 (4.9s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 3.2s | Tokens: 2321 in / 117 out | Cache: 4576 read / 0 write
[AI] LinkedIn is open and loading. I can see the Messaging icon in the top right. Let me tap on it to access messages.
[Tool] tap({"x":90,"y":9,"description":"Tap Messaging icon"})
[Tool] → Tapped (90%, 9%) - Tap Messaging icon (0.5s)

--- Step 3/25 (10.2s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 5.0s | Tokens: 4043 in / 218 out | Cache: 4576 read / 0 write
[AI] I can see the LinkedIn messaging inbox. I need to find the latest recruiter message. Looking at the messages, the most recent recruiter/InMail messages are:
1. Justin D. (Apr 6) - InMail from Capital One ...
[Tool] tap({"x":50,"y":31,"description":"Tap on Justin D. recruiter message from Capital One"})

--- Step 4/25 (17.1s total) ---
[AI] Responded in 4.5s | Tokens: 4765 in / 203 out | Cache: 4576 read / 0 write
[AI] The task is complete! ... conversation with Justin D. (Justin Dryer), a Students & Grads Tech Recruiter
[Tool] taskComplete({"summary":"Opened the latest recruiter message on LinkedIn — an InMail from Justin Dryer ..."})

✅ TASK COMPLETED
  Total time:    21.6s
  Steps:         4
  AI time:       15.0s (69%)
  Tool time:     2.5s (12%)
  Screenshot:    1.9s (9%)

Notice Cache: 4576 read / 0 write on steps 2-4 — prompt caching means the 5k-token system prompt + tool schemas don't re-bill per step. That's why a 4-step task costs ~2k uncached input tokens instead of ~20k.

Demo 2 — "Tell popcorn man I'm running late"

Same agent, different unlock: memory. Claude has never explicitly been told who "popcorn man" is — but a prior conversation stored "popcorn guy" = Zachary Chung in memories/user.md, which loads into the system prompt at launch. So the agent resolves the nickname on step 1 without asking.

Agent log — memory in action

--- Step 1/25 (0.0s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 3.6s | Tokens: 1995 in / 105 out | Cache: 0 read / 4576 write
[AI] I can see the task "Tell popcorn man I'm running late" is already being processed by the Mobile Agent.
    From memory, "popcorn guy" = Zachary Chung. Let me open Messages to send this directly.
[Tool] openApp({"appName":"Messages"})

--- Step 2/25 (6.2s total) ---
[AI] I can see Zachary's conversation is already open. Let me type and send the message.
[Tool] typeAndSubmit({"elementText":"iMessage","text":"Hey, I'm running late!","submitKey":"send"})
[Tool] → Message sent: "Hey, I'm running late!" — the send button was tapped automatically.

--- Step 6/25 (35.0s total) ---
[AI] The message "I'm running late" has been successfully sent to Zachary Chung (popcorn guy) and shows as Delivered.
[Tool] taskComplete(...)

✅ TASK COMPLETED
  Total time:    38.2s · 6 steps

No askUser prompt, no clarification round-trip. Memory is what turns "text my friend" into a command instead of a conversation.

Demo 3 — Drag-and-drop

Most mobile agents support tap, type, and scroll. Drag-and-drop is the long tail — games, reordering cards, Wordle-style puzzles, Notion-style interfaces. All of those need sustained touch + controlled movement + release, not a flick. We exposed drag as a first-class tool with configurable hold and drag durations so Claude can handle these interactions the way a finger would.

▶ Watch the drag demo — Claude picks up a card, moves it across the screen, and releases at a target slot.

The tool:

drag({
  startX, startY,    // pick-up point (percentages 0-100)
  endX,   endY,      // drop point
  holdDuration: 0.5, // seconds to long-press before moving
  dragDuration: 1.0  // seconds for the movement itself
})

This isn't a flick (swipe) or a tap — it's a touch-down → wait → translate → touch-up sequence, which unlocks any iOS interaction that needs drag gestures.

Demo 4 — Drawing a triangle in Freeform (first-time app launch)

The previous drag demo shows object manipulation. This one shows gestural composition on a blank canvas — and handles the kind of friction a real user hits on day one: a brand-new app they've never opened.

▶ Watch the triangle demo — autonomous drawing in Freeform on a physical iPhone.

What happens: Claude gets the task "Draw a triangle in Freeform," discovers Freeform isn't on the home screen, falls back to Spotlight search, opens the app, and then — because this is the user's first time opening Freeform — walks through three Apple onboarding screens on its own:

Welcome screen → dismissed with Continue
"What's New in Freeform" → dismissed with Continue
iCloud sync prompt → dismissed with Not Now

Then it selects the Pen tool, composes three sequential swipes into a triangle, and Freeform's shape recognition straightens the strokes into a clean shape.

Phase	Steps	What Claude did
Navigate	1-3	Spotlight search → tap Freeform top hit
Clear onboarding	4-6	3 Apple modals dismissed autonomously
Select tool	7-9	Pen tool + dismiss weight/opacity popup
Compose triangle	10-12	3 swipe strokes — explicit 3-stroke plan announced at step 10
Verify + finish	13	Tap Done

13 steps, 88 s. Full reasoning trace (every step's verbatim thought + HITL decision analysis + drag-primitive improvement ideas): benchmarks/demos/triangle-freeform-2026-04-21.md.

Honest caveat: the agent's final taskComplete call didn't fire due to an API quota interruption — the triangle was drawn on the canvas, but the structured log entry was lost. The demonstration of the gestural primitive itself is unaffected.

Why this demo matters beyond the hackathon: existing mobile GUI agents (Mobile-Agent, AppAgent, Mobile-Agent-v2) operate on discrete actions and target mostly-warmed-up Android emulators. A continuous-trajectory drag/swipe primitive on a physical iPhone — navigating first-launch onboarding friction — is genuinely new territory. See RESEARCH.md for the full framing.

How We Built It

Two processes on the Mac, one wired iPhone. Action Button POSTs a task; agent.mjs drives Claude; Maestro Bridge pipes taps and screenshots over USB; Companion App polls status and paints the Dynamic Island.

Stack

Node.js agent loop (agent.mjs)
Claude Sonnet 4.6 — vision + tool calling (the brain)
XCTest HTTP — taps, types, scrolls on the physical device
Maestro + JVM — USB bridge to XCTest runner
Swift companion app — Dynamic Island, AVSpeechSynthesizer narration
Action Button + iOS Shortcuts — voice trigger

How We're Using Claude

Claude isn't generating text here — it's driving the phone. The agent exposes ~22 tools to Claude; each step, Claude picks one.

Tool inventory (grouped by purpose)

Category	Tools
👆 Touch / Input	`tap` · `tapText` · `inputText` · `pressKey` · `typeAndSubmit` · `scroll` · `swipe` · `drag` · `hideKeyboard`
👁 Vision	`takeScreenshot` · `getUIElements` · `zoomAndTap`¹
🧠 Memory	`recallMemory` · `recallHistory` · `saveMemory`
🌐 Web	`webSearch`
⚡ iOS Shortcuts	`openApp`² · `searchMaps` · `getDirections` · `googleSearch` · `openURL` · `composeMessage` · `setAppearance` · `setLocation`
🎛 Control	`askUser` · `taskComplete` · `taskFailed`

_{¹ Only loaded when --grounding zoomclick is set — zooms into a region for precise tapping.

² Works on any installed app, not a fixed list. The agent builds the app map at launch.}

The tool-calling loop

Every step is one Claude API call. Screenshots are auto-captured after every action tool (tap, type, scroll, drag) — Claude never has to explicitly call takeScreenshot mid-task. After step 1, the system prompt + tool schemas come from cache — only the fresh screenshot + short delta are billed.

No hard-coded flows. Every task is solved live by Claude reasoning about what's on screen.

The cognition layer (how Claude thinks)

Beyond the per-step loop, there's a longer-lived architecture feeding context into every call. Three memory types (semantic, episodic, procedural — the CoALA taxonomy), an XML-tagged system prompt, a rolling text summary that compresses history, and a prompt cache that keeps the static prefix hot.

Research backing (so the design isn't ad-hoc):

CoALA (arXiv:2309.02427) — the semantic / episodic / procedural split above
MemGPT (arXiv:2310.08560) — tiered memory: always-in-prompt vs retrieved-on-demand
SecAgent (arXiv:2603.08533) — the rolling text summary + single-image compression that gave us the 3.40× speedup

Why Claude specifically

Prompt caching — cache_control: ephemeral on the system prompt + 20+ tool schemas. After step 1 the static block reads from cache — ~90% cheaper per step, and cache reads don't count toward ITPM.
Most reliable tool calling — across 15+ step agent loops Claude stays consistent at picking the right tool and formatting arguments correctly.
XML-tagged prompts are the documented idiom — our <RULES>, <IMPORTANT>, <STRATEGY> sections (in agent.mjs:1208) follow Anthropic's recommended pattern.
Vision + tool calling in a single round-trip — screenshot in, tool call out, one API call.
Long context that doesn't degrade — UI hierarchies + action history pile up; Claude stays sharp mid-conversation.

Optimizations

Technique	Effect
Prompt caching (system + tool schemas)	~90% cheaper per step after step 1
Rolling text summaries (single-image mode)	Older screenshots stripped, replaced with 1-line recap. Token count stays flat as task grows.
Research-backed system prompt	XML-tagged sections, forced screen-understanding chain (describe → read → identify → act), explicit parallel tool-calling, iOS UI priors
Persistent Maestro bridge	JVM boots once on port 6001, subsequent commands skip the ~8s cold start
Stuck detection	Last 3 actions scanned each turn; on repeat, recovery hint injected so Claude doesn't burn turns retrying a failing tap
Memory injection	Resolved ambiguities ("Kenny = Kenny Frias") persist in `memories/user.md` and load into the system prompt — same askUser question never asked twice

Measured findings (summary)

Six optimizations compose to make physical-iOS agent research practical:

Rolling summaries vs. full history — 3.40× speedup, ~30% fewer tokens per task
Direct HTTP to Maestro's XCTest runner vs. Maestro CLI — 10–100× action-latency reduction on launch / press-key / tap. An AI agent issues per-step dynamic commands (can't pre-write a YAML flow), so we bypass the CLI (which spawns a fresh JVM per call) and talk to Maestro's runner HTTP directly. The runner is Maestro's open-source work; the bypass-for-agent-workflow is ours.
Screenshot compression — ~50× payload reduction (8 MB Retina PNG → 150 KB JPEG) with an accuracy gain (higher resolution hurts VLM grounding, per ScreenSpot-Pro)
Iterative agent-loop optimization — 83 s / failed → 23.9 s / success on a representative task (3.5× speedup, 3× fewer steps)
Prompt caching — ~90% per-step cost reduction after step 1
No single bottleneck remains — time distribution of the optimized run: 31% AI inference · 31% tool execution · 30% screenshot · 8% overhead

➡ Full figures, per-optimization methodology, and reproducibility notes in RESEARCH.md.

Ethical Alignment

Who this is built for:

Group	What we unlock
Blind & low-vision users	Full iPhone control, eyes-free. AVSpeechSynthesizer narrates every step.
Elderly users	No more wrestling with complex app navigation — speak the task.
Users with motor impairments	Speak instead of tap.
Cognitive accessibility	Step-by-step narration reduces overwhelm.

Potential Harms & Future Solutions

We've thought hard about what could go wrong:

Risk	Future Mitigation
Privacy — screenshots of your life go to a third-party AI	App blocklist · per-action consent · on-device redaction of sensitive UI
Identity — no voice ownership check	Biometrics · passphrase · enrolled-voice verification
Reliability — agent can act wrong; a blind user may not notice	Pre + post action confirmation · undo window · narration of every destructive step

The human-in-the-loop askUser flow is the foundation: Claude pauses on ambiguous or destructive actions and surfaces options through the Dynamic Island.

Getting Started

Prerequisites

macOS with Xcode
Node.js 18+
Maestro CLI + OpenJDK
Anthropic API key
Physical iPhone 15+ (for Action Button + Dynamic Island), Developer Mode enabled

Install

git clone https://github.com/bryanrg22/ios-agent_automation.git
cd ios-agent_automation
npm install

# .env
echo "ANTHROPIC_API_KEY=your_key_here" > .env

# Maestro
curl -Ls "https://get.maestro.mobile.dev" | bash
brew install openjdk

Run on a physical iPhone (2 terminals)

# Terminal 1 — Maestro bridge over USB
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH:$HOME/.maestro/bin"
maestro-ios-device --team-id YOUR_TEAM_ID --device YOUR_UDID

# Terminal 2 — Frontend server (Action Button + Dynamic Island)
node frontend/server.mjs --provider anthropic

Then hold the Action Button and speak. The companion app's Dynamic Island shows live progress and narrates each step.

Run on iOS Simulator (no iPhone required)

Two supported simulator paths — pick based on what you need:

Fast path (recommended) — direct HTTP to Maestro's XCTest runner inside the simulator. Matches the phone-path latency (~50–400 ms per action).

# One-time: clone Maestro's runner source (1.8 MB, sparse checkout — runner subdir only)
mkdir -p vendor && cd vendor
git clone --depth 1 --filter=blob:none --sparse https://github.com/mobile-dev-inc/Maestro.git
cd Maestro && git sparse-checkout set maestro-ios-xctest-runner
cd ../..

# Terminal 1 — launch the runner (auto-boots a simulator, builds on first run)
./scripts/start-sim-runner.sh

# Terminal 2 — run a task
node agent.mjs "open Settings and tap Wi-Fi" --backend direct-http-sim

Baseline path (Maestro CLI, slow — kept for the speedup comparison):

# Just run the agent — no separate runner needed; Maestro MCP spins up on demand
node agent.mjs "open Settings and tap Wi-Fi"

The fast and baseline simulator paths produce comparable results for step counts, tokens, and success rates. Wall-clock time differs by ~10–100× per action (that's the whole point — see RESEARCH.md §2).

App coverage on the simulator vs. the phone. The iOS Simulator only includes whatever apps ship pre-installed in its runtime image — a curated Apple subset (~16 apps on iOS 26: Settings, Maps, Calendar, Messages, Safari, Photos, Reminders, Shortcuts, etc.). There is no App Store on simulators (Apple doesn't bundle it), so third-party apps (Reddit, Instagram, your bank) cannot be installed via the App Store on a sim — they have to be added via xcrun simctl install from a local .app build or .ipa. Some first-party apps (Freeform, Notes, Stocks, Voice Memos, Translate, etc.) may also be missing from the runtime depending on the iOS version, and they're not redistributable. The physical iPhone path has no such limitation: every app already installed on your phone is available to the agent immediately, with no extra setup. If your research or task targets real-world third-party apps — or any first-party app missing from the simulator runtime — run on the phone. Use the simulator for the pre-installed app subset, gesture primitives that don't depend on a specific app, and fast iteration during development.

Action Button shortcut setup

The iPhone Action Button needs a one-time Shortcut to capture your voice and POST it to the server.

Left: In Settings → Action Button, assign the Shortcut. Right: The Shortcut runs Dictate Text → Get Contents of URL (POST) to http://<mac-ip>:8000/task.

Run a one-shot task from the terminal

node agent.mjs "open Spotify and play lofi beats" --phone --provider anthropic

Run the benchmark harness (matrix runs across tasks × models × seeds)

benchmarks/run.mjs spawns agent.mjs once per cell, pairs the agent's self-reported success with an independent evaluator (CLIP similarity / hierarchy_contains / game_specific / manual), and appends one row per run to benchmarks/results/<timestamp>.csv. Tasks live in benchmarks/tasks.json.

On the iOS Simulator (fast path):

# Terminal 1 — start the XCTest runner (same as for agent.mjs --backend direct-http-sim)
./scripts/start-sim-runner.sh

# Terminal 2 — run a single task
node benchmarks/run.mjs \
  --tasks nav_maps_search_usc \
  --seeds 1 \
  --models anthropic:claude-sonnet-4-6 \
  --backend direct-http-sim

# Or a full category × multiple models × multiple seeds
node benchmarks/run.mjs \
  --tasks navigation \
  --seeds 3 \
  --models anthropic:claude-sonnet-4-6,openai:gpt-5.4,gemini:gemini-2.5-flash-lite \
  --backend direct-http-sim

On a physical iPhone:

# Terminal 1 — Maestro bridge over USB (same as for agent.mjs --phone)
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH:$HOME/.maestro/bin"
maestro-ios-device --team-id YOUR_TEAM_ID --device YOUR_UDID

# Terminal 2 — run a drawing task (CLIP-scored)
node benchmarks/run.mjs \
  --tasks draw_triangle_freeform \
  --seeds 1 \
  --models anthropic:claude-sonnet-4-6 \
  --phone

# Or the full manifest × 3 models × 3 seeds (matrix mode)
node benchmarks/run.mjs \
  --tasks all \
  --seeds 3 \
  --models anthropic:claude-sonnet-4-6,openai:gpt-5.4,gemini:gemini-2.5-flash-lite \
  --phone

Useful flags:

Flag	What it does	Default
`--tasks`	Task IDs (comma-separated), category name, or `all`	required
`--models`	`provider:model` pairs, comma-separated	`anthropic:claude-sonnet-4-6`
`--seeds`	Independent runs per cell (variance)	`1`
`--phone`	Run on physical iPhone (else simulator)	off
`--backend`	`maestro-cli` (sim baseline, slow) · `direct-http-sim` (sim fast) · `phone-direct` (= `--phone`)	auto
`--grounding`	`baseline` · `grid` · `zoomclick`	`baseline`
`--agent-mode`	`baseline` · `single-image` · `vision-gated`	`single-image`
`--max-steps-override`	Override per-task `max_steps` from the manifest	task-defined
`--dry-run`	Print the matrix without spawning runs	off
`--stop-on-failure`	Halt after the first failure (debugging)	off

Output: every run appends one row to benchmarks/results/run-<ISO_TIMESTAMP>.csv. Each row contains the agent's self-reported success / steps / tokens / time, the evaluator's verdict and score, the post-run screenshot path, and any error string.

Notes:

Drawing tasks need CLIP: pip install open-clip-torch torch pillow (one-time).
App availability differs between sim and phone: drawing tasks (Freeform), gesture-game tasks (Clash Royale, etc.), and any third-party app tasks are phone-only — the iOS Simulator only includes pre-installed Apple apps. See the "App coverage" note in the simulator section above.
For --backend direct-http-sim, the start-sim-runner.sh terminal must stay open for the full run.
Use --dry-run to size the matrix before long runs — --tasks all --seeds 5 --models a,b,c is (tasks × 5 × 3) cells and can run for hours.

Repository Layout

agent.mjs                     Tool-based agent loop (Claude tool calling)
frontend/server.mjs           HTTP server: /task, /status, /respond, /stop
ios/MobileAgentCompanion/     Swift companion app + Dynamic Island Live Activity
memories/                     Persistent semantic memory (CoALA-style)
logs/tasks.jsonl              Episodic memory — every task run
docs/                         Deep dives on individual subsystems
benchmarks/demos/             Per-demo run records with reasoning traces
benchmarks/plots/             Deterministic plot-regeneration scripts
public/                       Images, videos, generated figures
CLAUDE.md                     Instructions for Claude Code in this repo
RESEARCH.md                   Research deep-dive — paper framing, citation, reproducibility
CITATION.cff                  GitHub citation metadata
LICENSE                       MIT

For Researchers

This project is an open research platform for agentic automation on physical iOS and iOS Simulator — built so other researchers can extend and advance the field. Early results from our own work already show a 3.40× speedup from rolling-summary context compression and a 10–100× action-latency reduction for agent-driven dynamic commands. We're just getting started.

If you're here for the research angle, not the hackathon, jump to:

RESEARCH.md — contributions, reproducibility steps, related work, research directions we're exploring
benchmarks/demos/ — per-demo run records with full reasoning traces (e.g., the triangle demo's verbatim agent thoughts + HITL decision analysis)
benchmarks/plots/plot_optimizations.py — deterministic regeneration of every figure in this README
docs/ — subsystem deep-dives (memory architecture, loop optimization, screen understanding, procedural memory, etc.)
CITATION.cff — machine-readable citation metadata; GitHub auto-renders a "Cite this repository" button
CONTRIBUTING.md — how to add benchmark tasks, evaluators, and open PRs

How to cite: see RESEARCH.md#how-to-cite-this-work. Short form — cite the repo via CITATION.cff.

Platform support (v0.1)

We're honest about the scope:

Surface	Supported in v0.1	Notes
iPhone (physical device)	✅	Primary target. Full Dynamic Island HITL UI via `MobileAgentCompanion`.
iOS Simulator	✅	Fast path via `--backend direct-http-sim`. Apple-bundled apps only (no App Store).
iPad	⚠️ Partial	Agent core works, but the companion app's HITL UI is iPhone-only — iPads don't have Dynamic Island. Full iPad HITL (modal / alert-based fallback) is planned for v0.2.
Mac Catalyst	❌	Not a target.
visionOS / watchOS	❌	Not a target.

If you need iPad today: the agent loop (agent.mjs + Maestro bridge) runs against an iPad fine — it's the same XCTest surface. You just won't get the Dynamic Island companion HITL UI; you'd need to use the terminal-driven askUser path in agent.mjs instead. If you're interested in contributing an iPad-native HITL UI, see CONTRIBUTING.md — this is an explicit good-first-contribution area.

Customizing which tools the agent has access to

Not every researcher has API keys for every optional tool. Disable specific tools at startup:

# Run without web search (no BRAVE_API_KEY needed):
node agent.mjs "search for USC on Maps" --backend direct-http-sim --disable-tools webSearch

# Disable multiple tools (e.g., for ablation studies):
node agent.mjs "draw a circle in Freeform" --phone --disable-tools webSearch,googleSearch,recallMemory

Do NOT disable the terminal flow-control tools (taskComplete, taskFailed) — the agent loop needs them to terminate.

Built On (Open-Source Acknowledgments)

This platform is built on top of the work of others. To be explicit about what we use vs. what we contribute:

Maestro (mobile.dev) — open-source mobile UI testing framework. The XCTest runner (maestro-ios-xctest-runner) that exposes HTTP endpoints on the device/simulator is Maestro's. We talk to it directly, bypassing the Maestro CLI, for the 10–100× action-latency win on agent-driven dynamic workflows.
DeviceLab maestro-ios-device — community binary that builds Maestro's XCTest runner for a physical iOS device and port-forwards localhost:6001 → device :22087 over USB (usbmuxd). This is the bridge our --phone mode relies on. Built from Maestro PR #2856.
Anthropic Claude — the reasoning brain (prompt caching, tool use, vision).
Apple XCUITest / XCTest — the underlying iOS automation APIs the Maestro runner wraps.

What's ours: the AI agent loop (agent.mjs), the on-device Swift agent, the drag primitive used as an autonomous agent tool, the gestural task benchmark, the Dynamic Island agent UX, the CoALA-style memory stack applied to mobile agents, the prompt-caching + screenshot-compression + rolling-summary optimization pipeline, and the cross-model LLM comparison harness.

The Team


Bryan Ramirez-Gonzalez	Kenny Frias	Zachary Chung	Victoria Rojas
AI Researcher @ USC, NVIDIA	AI Researcher @ Columbia University	Network / Systems Researcher @ USC	Hardware & Systems Researcher @ USC

🥉 3rd Place — Claude Builder Hackathon, UCLA, April 2026. Presenting Agentic Automation on iOS.

Built at UCLA · April 2026 · for the Claude Builder Hackathon.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
benchmarks		benchmarks
docs		docs
frontend		frontend
ios/MobileAgentCompanion		ios/MobileAgentCompanion
procedures		procedures
public		public
scripts		scripts
src		src
vendor/maestro-runner		vendor/maestro-runner
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RESEARCH.md		RESEARCH.md
agent.mjs		agent.mjs
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
procedures.mjs		procedures.mjs
run-phone.mjs		run-phone.mjs
run.mjs		run.mjs
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Folders and files

Latest commit

History

Repository files navigation

Agentic Automation on iOS — Winning Team @ Claude Builder Hackathon 2026

The Problem

Why this matters even post–Siri 2.0

What We Built

Demo

Demo 1 — "Open my latest recruiter message on LinkedIn"

Demo 2 — "Tell popcorn man I'm running late"

Demo 3 — Drag-and-drop

Demo 4 — Drawing a triangle in Freeform (first-time app launch)

How We Built It

How We're Using Claude

Tool inventory (grouped by purpose)

The tool-calling loop

The cognition layer (how Claude thinks)

Why Claude specifically

Optimizations

Measured findings (summary)

Ethical Alignment

Potential Harms & Future Solutions

Getting Started

Prerequisites

Install

Run on a physical iPhone (2 terminals)

Run on iOS Simulator (no iPhone required)

Action Button shortcut setup

Run a one-shot task from the terminal

Run the benchmark harness (matrix runs across tasks × models × seeds)

Repository Layout

For Researchers

Platform support (v0.1)

Customizing which tools the agent has access to

Built On (Open-Source Acknowledgments)

The Team

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages