Skip to content

bryanrg22/ios-agent_automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Automation on iOS — Winning Team @ Claude Builder Hackathon 2026

A voice-controlled AI agent that operates a real iPhone — built on Claude.

MIT License   3rd Place @ Claude Builder Hackathon   Research Platform — click to read RESEARCH.md   Cite this work

🏆 Winning team at the Claude Builder Hackathon — UCLA, April 2026. 📄 Also an open research platform for agentic automation on physical iOS (and the iOS Simulator) — built for other researchers to extend. See RESEARCH.md for contributions, findings, and citation info.

Hold the Action Button, speak a task, and watch Claude take over your phone — narrating every step out loud so blind and low-vision users can use any iOS app.

Agentic Automation on iOS — built on Claude


The Problem

iOS accessibility hasn't caught up with the AI era.

  • 2.2 billion people globally have a vision impairment
  • 43 million people are blind worldwide
  • Only ~5% of mobile apps fully meet WCAG accessibility guidelines

Sources: World Health Organization · Accessibility Works (2025)

VoiceOver helps, but it's a screen reader — it doesn't do things for you. Siri can do things, but only inside Apple's own apps and only as single-step commands. There's no agent on iOS that can look at any app, reason about it, and complete a multi-step task on your behalf.

We built one.

Why this matters even post–Siri 2.0

Apple's automation story is tied to apps opting in. Siri today works reliably inside Apple-native apps (Messages, Calendar, Maps); support for third-party apps depends on developers shipping AppIntents / the rumored App Intents MCP surface. That's a multi-year ramp — and for every small or long-tail app that never ships one, Siri can't help at all.

A vision-based agent that can look at any screen and act closes that gap immediately. It doesn't wait for developers to adopt anything. It works today on LinkedIn, CVS, your bank's app, and whatever else is already on the user's phone. When Siri 2.0 ships, this same capability becomes the fallback for the long tail Siri won't cover on day one.


What We Built

Press and hold the iPhone Action Button, speak a command — "send my grandson a message," "refill my prescription on the CVS app," "check my calendar for tomorrow" — and Claude takes over.

Holding the iPhone Action Button to speak a task

The interaction: press and hold, speak the task, release.

  • Claude sees your screen every step (vision + UI hierarchy)
  • Claude picks the next action from 20+ tools (tap, type, scroll, swipe, drag, openApp, askUser, …)
  • AVSpeechSynthesizer narrates what's happening so blind users hear exactly what the agent is doing
  • Dynamic Island shows live progress with a stop button
  • Human-in-the-loop: when Claude is uncertain (e.g., two contacts named Kenny), it pauses and asks via the Dynamic Island

It works on any installed app — not just Apple's. That's the unlock.

Dynamic Island — live agent progress with step count and stop button     Human-in-the-loop — Claude asks before sending

Left: Dynamic Island shows real-time progress — current step, phase, elapsed time, and a live stop button. Right: Human-in-the-loop — Claude pauses and asks before any irreversible action (send a message, place a call, delete data).


Demo

Demo 1 — "Open my latest recruiter message on LinkedIn"

4 steps, 21.6s end-to-end. Claude opens LinkedIn, navigates to Messages, identifies the most recent recruiter thread (a Capital One InMail from Justin Dryer), and opens it.

Watch the demo (opens in browser)

Every task is logged on-device. The companion app shows the run as a card — duration, step count, model, and the full agent logs.

Companion app showing completed task card with agent logs

Task card in the companion app: 4 steps · 21.6s · claude-sonnet-4-6 · 47 lines of agent logs. Tap a run to replay Claude's reasoning step by step.

Agent log — click to expand
--- Step 1/25 (0.0s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 2.3s | Tokens: 1987 in / 86 out | Cache: 0 read / 4576 write
[AI] I can see the Mobile Agent app is already processing the task "Open my latest recruiter message on LinkedIn". Let me open LinkedIn directly to complete this task.
[Tool] openApp({"appName":"LinkedIn"})
[Tool] → Opened LinkedIn (1.5s)
[Auto-capture] Screenshot queued after openApp
[Auto-UI] 13 elements bundled with screenshot
[Screenshot] OK (553 KB, 0.6s)
[Step 1] 4.4s (AI: 2.3s | Tools: 1.5s | Screenshot: 0.6s)

--- Step 2/25 (4.9s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 3.2s | Tokens: 2321 in / 117 out | Cache: 4576 read / 0 write
[AI] LinkedIn is open and loading. I can see the Messaging icon in the top right. Let me tap on it to access messages.
[Tool] tap({"x":90,"y":9,"description":"Tap Messaging icon"})
[Tool] → Tapped (90%, 9%) - Tap Messaging icon (0.5s)

--- Step 3/25 (10.2s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 5.0s | Tokens: 4043 in / 218 out | Cache: 4576 read / 0 write
[AI] I can see the LinkedIn messaging inbox. I need to find the latest recruiter message. Looking at the messages, the most recent recruiter/InMail messages are:
1. Justin D. (Apr 6) - InMail from Capital One ...
[Tool] tap({"x":50,"y":31,"description":"Tap on Justin D. recruiter message from Capital One"})

--- Step 4/25 (17.1s total) ---
[AI] Responded in 4.5s | Tokens: 4765 in / 203 out | Cache: 4576 read / 0 write
[AI] The task is complete! ... conversation with Justin D. (Justin Dryer), a Students & Grads Tech Recruiter
[Tool] taskComplete({"summary":"Opened the latest recruiter message on LinkedIn — an InMail from Justin Dryer ..."})

✅ TASK COMPLETED
  Total time:    21.6s
  Steps:         4
  AI time:       15.0s (69%)
  Tool time:     2.5s (12%)
  Screenshot:    1.9s (9%)

Notice Cache: 4576 read / 0 write on steps 2-4 — prompt caching means the 5k-token system prompt + tool schemas don't re-bill per step. That's why a 4-step task costs ~2k uncached input tokens instead of ~20k.

Demo 2 — "Tell popcorn man I'm running late"

Same agent, different unlock: memory. Claude has never explicitly been told who "popcorn man" is — but a prior conversation stored "popcorn guy" = Zachary Chung in memories/user.md, which loads into the system prompt at launch. So the agent resolves the nickname on step 1 without asking.

Agent log — memory in action
--- Step 1/25 (0.0s total) ---
[AI] Sending to claude-sonnet-4-6...
[AI] Responded in 3.6s | Tokens: 1995 in / 105 out | Cache: 0 read / 4576 write
[AI] I can see the task "Tell popcorn man I'm running late" is already being processed by the Mobile Agent.
    From memory, "popcorn guy" = Zachary Chung. Let me open Messages to send this directly.
[Tool] openApp({"appName":"Messages"})

--- Step 2/25 (6.2s total) ---
[AI] I can see Zachary's conversation is already open. Let me type and send the message.
[Tool] typeAndSubmit({"elementText":"iMessage","text":"Hey, I'm running late!","submitKey":"send"})
[Tool] → Message sent: "Hey, I'm running late!" — the send button was tapped automatically.

--- Step 6/25 (35.0s total) ---
[AI] The message "I'm running late" has been successfully sent to Zachary Chung (popcorn guy) and shows as Delivered.
[Tool] taskComplete(...)

✅ TASK COMPLETED
  Total time:    38.2s · 6 steps

No askUser prompt, no clarification round-trip. Memory is what turns "text my friend" into a command instead of a conversation.

Demo 3 — Drag-and-drop

Most mobile agents support tap, type, and scroll. Drag-and-drop is the long tail — games, reordering cards, Wordle-style puzzles, Notion-style interfaces. All of those need sustained touch + controlled movement + release, not a flick. We exposed drag as a first-class tool with configurable hold and drag durations so Claude can handle these interactions the way a finger would.

Watch the drag demo — Claude picks up a card, moves it across the screen, and releases at a target slot.

The tool:

drag({
  startX, startY,    // pick-up point (percentages 0-100)
  endX,   endY,      // drop point
  holdDuration: 0.5, // seconds to long-press before moving
  dragDuration: 1.0  // seconds for the movement itself
})

This isn't a flick (swipe) or a tap — it's a touch-down → wait → translate → touch-up sequence, which unlocks any iOS interaction that needs drag gestures.

Demo 4 — Drawing a triangle in Freeform (first-time app launch)

The previous drag demo shows object manipulation. This one shows gestural composition on a blank canvas — and handles the kind of friction a real user hits on day one: a brand-new app they've never opened.

Watch the triangle demo — autonomous drawing in Freeform on a physical iPhone.

What happens: Claude gets the task "Draw a triangle in Freeform," discovers Freeform isn't on the home screen, falls back to Spotlight search, opens the app, and then — because this is the user's first time opening Freeform — walks through three Apple onboarding screens on its own:

  1. Welcome screen → dismissed with Continue
  2. "What's New in Freeform" → dismissed with Continue
  3. iCloud sync prompt → dismissed with Not Now

Then it selects the Pen tool, composes three sequential swipes into a triangle, and Freeform's shape recognition straightens the strokes into a clean shape.

Phase Steps What Claude did
Navigate 1-3 Spotlight search → tap Freeform top hit
Clear onboarding 4-6 3 Apple modals dismissed autonomously
Select tool 7-9 Pen tool + dismiss weight/opacity popup
Compose triangle 10-12 3 swipe strokes — explicit 3-stroke plan announced at step 10
Verify + finish 13 Tap Done

13 steps, 88 s. Full reasoning trace (every step's verbatim thought + HITL decision analysis + drag-primitive improvement ideas): benchmarks/demos/triangle-freeform-2026-04-21.md.

Honest caveat: the agent's final taskComplete call didn't fire due to an API quota interruption — the triangle was drawn on the canvas, but the structured log entry was lost. The demonstration of the gestural primitive itself is unaffected.

Why this demo matters beyond the hackathon: existing mobile GUI agents (Mobile-Agent, AppAgent, Mobile-Agent-v2) operate on discrete actions and target mostly-warmed-up Android emulators. A continuous-trajectory drag/swipe primitive on a physical iPhone — navigating first-launch onboarding friction — is genuinely new territory. See RESEARCH.md for the full framing.


How We Built It

Architecture — iPhone (Action Button, XCTest Runner, Companion App, Dynamic Island) ↔ MacBook (server.mjs, agent.mjs, Maestro Bridge)

Two processes on the Mac, one wired iPhone. Action Button POSTs a task; agent.mjs drives Claude; Maestro Bridge pipes taps and screenshots over USB; Companion App polls status and paints the Dynamic Island.

Stack

  • Node.js agent loop (agent.mjs)
  • Claude Sonnet 4.6 — vision + tool calling (the brain)
  • XCTest HTTP — taps, types, scrolls on the physical device
  • Maestro + JVM — USB bridge to XCTest runner
  • Swift companion app — Dynamic Island, AVSpeechSynthesizer narration
  • Action Button + iOS Shortcuts — voice trigger

How We're Using Claude

Claude isn't generating text here — it's driving the phone. The agent exposes ~22 tools to Claude; each step, Claude picks one.

Tool inventory (grouped by purpose)

Category Tools
👆 Touch / Input tap · tapText · inputText · pressKey · typeAndSubmit · scroll · swipe · drag · hideKeyboard
👁 Vision takeScreenshot · getUIElements · zoomAndTap¹
🧠 Memory recallMemory · recallHistory · saveMemory
🌐 Web webSearch
⚡ iOS Shortcuts openApp² · searchMaps · getDirections · googleSearch · openURL · composeMessage · setAppearance · setLocation
🎛 Control askUser · taskComplete · taskFailed

¹ Only loaded when --grounding zoomclick is set — zooms into a region for precise tapping.
² Works on any installed app, not a fixed list. The agent builds the app map at launch.

The tool-calling loop

Every step is one Claude API call. Screenshots are auto-captured after every action tool (tap, type, scroll, drag) — Claude never has to explicitly call takeScreenshot mid-task. After step 1, the system prompt + tool schemas come from cache — only the fresh screenshot + short delta are billed.

Tool-calling loop — screenshot + getUIElements → Claude picks a tool → auto-capture → prompt cache → repeat

No hard-coded flows. Every task is solved live by Claude reasoning about what's on screen.

The cognition layer (how Claude thinks)

Beyond the per-step loop, there's a longer-lived architecture feeding context into every call. Three memory types (semantic, episodic, procedural — the CoALA taxonomy), an XML-tagged system prompt, a rolling text summary that compresses history, and a prompt cache that keeps the static prefix hot.

Cognition layer — Memory (semantic · episodic · procedural) → System Prompt → Context Window (cached) → Claude → reasoning → next tool call, with stuck detection feedback

Research backing (so the design isn't ad-hoc):

  • CoALA (arXiv:2309.02427) — the semantic / episodic / procedural split above
  • MemGPT (arXiv:2310.08560) — tiered memory: always-in-prompt vs retrieved-on-demand
  • SecAgent (arXiv:2603.08533) — the rolling text summary + single-image compression that gave us the 3.40× speedup

Why Claude specifically

  1. Prompt cachingcache_control: ephemeral on the system prompt + 20+ tool schemas. After step 1 the static block reads from cache — ~90% cheaper per step, and cache reads don't count toward ITPM.
  2. Most reliable tool calling — across 15+ step agent loops Claude stays consistent at picking the right tool and formatting arguments correctly.
  3. XML-tagged prompts are the documented idiom — our <RULES>, <IMPORTANT>, <STRATEGY> sections (in agent.mjs:1208) follow Anthropic's recommended pattern.
  4. Vision + tool calling in a single round-trip — screenshot in, tool call out, one API call.
  5. Long context that doesn't degrade — UI hierarchies + action history pile up; Claude stays sharp mid-conversation.

Optimizations

Technique Effect
Prompt caching (system + tool schemas) ~90% cheaper per step after step 1
Rolling text summaries (single-image mode) Older screenshots stripped, replaced with 1-line recap. Token count stays flat as task grows.
Research-backed system prompt XML-tagged sections, forced screen-understanding chain (describe → read → identify → act), explicit parallel tool-calling, iOS UI priors
Persistent Maestro bridge JVM boots once on port 6001, subsequent commands skip the ~8s cold start
Stuck detection Last 3 actions scanned each turn; on repeat, recovery hint injected so Claude doesn't burn turns retrying a failing tap
Memory injection Resolved ambiguities ("Kenny = Kenny Frias") persist in memories/user.md and load into the system prompt — same askUser question never asked twice

Measured findings (summary)

Six optimizations compose to make physical-iOS agent research practical:

  • Rolling summaries vs. full history3.40× speedup, ~30% fewer tokens per task
  • Direct HTTP to Maestro's XCTest runner vs. Maestro CLI10–100× action-latency reduction on launch / press-key / tap. An AI agent issues per-step dynamic commands (can't pre-write a YAML flow), so we bypass the CLI (which spawns a fresh JVM per call) and talk to Maestro's runner HTTP directly. The runner is Maestro's open-source work; the bypass-for-agent-workflow is ours.
  • Screenshot compression~50× payload reduction (8 MB Retina PNG → 150 KB JPEG) with an accuracy gain (higher resolution hurts VLM grounding, per ScreenSpot-Pro)
  • Iterative agent-loop optimization83 s / failed → 23.9 s / success on a representative task (3.5× speedup, 3× fewer steps)
  • Prompt caching~90% per-step cost reduction after step 1
  • No single bottleneck remains — time distribution of the optimized run: 31% AI inference · 31% tool execution · 30% screenshot · 8% overhead

➡ Full figures, per-optimization methodology, and reproducibility notes in RESEARCH.md.


Ethical Alignment

Who this is built for:

Group What we unlock
Blind & low-vision users Full iPhone control, eyes-free. AVSpeechSynthesizer narrates every step.
Elderly users No more wrestling with complex app navigation — speak the task.
Users with motor impairments Speak instead of tap.
Cognitive accessibility Step-by-step narration reduces overwhelm.

Potential Harms & Future Solutions

We've thought hard about what could go wrong:

Risk Future Mitigation
Privacy — screenshots of your life go to a third-party AI App blocklist · per-action consent · on-device redaction of sensitive UI
Identity — no voice ownership check Biometrics · passphrase · enrolled-voice verification
Reliability — agent can act wrong; a blind user may not notice Pre + post action confirmation · undo window · narration of every destructive step

The human-in-the-loop askUser flow is the foundation: Claude pauses on ambiguous or destructive actions and surfaces options through the Dynamic Island.


Getting Started

Prerequisites

  • macOS with Xcode
  • Node.js 18+
  • Maestro CLI + OpenJDK
  • Anthropic API key
  • Physical iPhone 15+ (for Action Button + Dynamic Island), Developer Mode enabled

Install

git clone https://github.com/bryanrg22/ios-agent_automation.git
cd ios-agent_automation
npm install

# .env
echo "ANTHROPIC_API_KEY=your_key_here" > .env

# Maestro
curl -Ls "https://get.maestro.mobile.dev" | bash
brew install openjdk

Run on a physical iPhone (2 terminals)

# Terminal 1 — Maestro bridge over USB
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH:$HOME/.maestro/bin"
maestro-ios-device --team-id YOUR_TEAM_ID --device YOUR_UDID

# Terminal 2 — Frontend server (Action Button + Dynamic Island)
node frontend/server.mjs --provider anthropic

Then hold the Action Button and speak. The companion app's Dynamic Island shows live progress and narrates each step.

Run on iOS Simulator (no iPhone required)

Two supported simulator paths — pick based on what you need:

Fast path (recommended) — direct HTTP to Maestro's XCTest runner inside the simulator. Matches the phone-path latency (~50–400 ms per action).

# One-time: clone Maestro's runner source (1.8 MB, sparse checkout — runner subdir only)
mkdir -p vendor && cd vendor
git clone --depth 1 --filter=blob:none --sparse https://github.com/mobile-dev-inc/Maestro.git
cd Maestro && git sparse-checkout set maestro-ios-xctest-runner
cd ../..

# Terminal 1 — launch the runner (auto-boots a simulator, builds on first run)
./scripts/start-sim-runner.sh

# Terminal 2 — run a task
node agent.mjs "open Settings and tap Wi-Fi" --backend direct-http-sim

Baseline path (Maestro CLI, slow — kept for the speedup comparison):

# Just run the agent — no separate runner needed; Maestro MCP spins up on demand
node agent.mjs "open Settings and tap Wi-Fi"

The fast and baseline simulator paths produce comparable results for step counts, tokens, and success rates. Wall-clock time differs by ~10–100× per action (that's the whole point — see RESEARCH.md §2).

App coverage on the simulator vs. the phone. The iOS Simulator only includes whatever apps ship pre-installed in its runtime image — a curated Apple subset (~16 apps on iOS 26: Settings, Maps, Calendar, Messages, Safari, Photos, Reminders, Shortcuts, etc.). There is no App Store on simulators (Apple doesn't bundle it), so third-party apps (Reddit, Instagram, your bank) cannot be installed via the App Store on a sim — they have to be added via xcrun simctl install from a local .app build or .ipa. Some first-party apps (Freeform, Notes, Stocks, Voice Memos, Translate, etc.) may also be missing from the runtime depending on the iOS version, and they're not redistributable. The physical iPhone path has no such limitation: every app already installed on your phone is available to the agent immediately, with no extra setup. If your research or task targets real-world third-party apps — or any first-party app missing from the simulator runtime — run on the phone. Use the simulator for the pre-installed app subset, gesture primitives that don't depend on a specific app, and fast iteration during development.

Action Button shortcut setup

The iPhone Action Button needs a one-time Shortcut to capture your voice and POST it to the server.

iPhone Settings — Action Button mapped to Shortcut     Shortcut — Dictate Text + POST to agent server

Left: In Settings → Action Button, assign the Shortcut. Right: The Shortcut runs Dictate TextGet Contents of URL (POST) to http://<mac-ip>:8000/task.

Run a one-shot task from the terminal

node agent.mjs "open Spotify and play lofi beats" --phone --provider anthropic

Run the benchmark harness (matrix runs across tasks × models × seeds)

benchmarks/run.mjs spawns agent.mjs once per cell, pairs the agent's self-reported success with an independent evaluator (CLIP similarity / hierarchy_contains / game_specific / manual), and appends one row per run to benchmarks/results/<timestamp>.csv. Tasks live in benchmarks/tasks.json.

On the iOS Simulator (fast path):

# Terminal 1 — start the XCTest runner (same as for agent.mjs --backend direct-http-sim)
./scripts/start-sim-runner.sh

# Terminal 2 — run a single task
node benchmarks/run.mjs \
  --tasks nav_maps_search_usc \
  --seeds 1 \
  --models anthropic:claude-sonnet-4-6 \
  --backend direct-http-sim

# Or a full category × multiple models × multiple seeds
node benchmarks/run.mjs \
  --tasks navigation \
  --seeds 3 \
  --models anthropic:claude-sonnet-4-6,openai:gpt-5.4,gemini:gemini-2.5-flash-lite \
  --backend direct-http-sim

On a physical iPhone:

# Terminal 1 — Maestro bridge over USB (same as for agent.mjs --phone)
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH:$HOME/.maestro/bin"
maestro-ios-device --team-id YOUR_TEAM_ID --device YOUR_UDID

# Terminal 2 — run a drawing task (CLIP-scored)
node benchmarks/run.mjs \
  --tasks draw_triangle_freeform \
  --seeds 1 \
  --models anthropic:claude-sonnet-4-6 \
  --phone

# Or the full manifest × 3 models × 3 seeds (matrix mode)
node benchmarks/run.mjs \
  --tasks all \
  --seeds 3 \
  --models anthropic:claude-sonnet-4-6,openai:gpt-5.4,gemini:gemini-2.5-flash-lite \
  --phone

Useful flags:

Flag What it does Default
--tasks Task IDs (comma-separated), category name, or all required
--models provider:model pairs, comma-separated anthropic:claude-sonnet-4-6
--seeds Independent runs per cell (variance) 1
--phone Run on physical iPhone (else simulator) off
--backend maestro-cli (sim baseline, slow) · direct-http-sim (sim fast) · phone-direct (= --phone) auto
--grounding baseline · grid · zoomclick baseline
--agent-mode baseline · single-image · vision-gated single-image
--max-steps-override Override per-task max_steps from the manifest task-defined
--dry-run Print the matrix without spawning runs off
--stop-on-failure Halt after the first failure (debugging) off

Output: every run appends one row to benchmarks/results/run-<ISO_TIMESTAMP>.csv. Each row contains the agent's self-reported success / steps / tokens / time, the evaluator's verdict and score, the post-run screenshot path, and any error string.

Notes:

  • Drawing tasks need CLIP: pip install open-clip-torch torch pillow (one-time).
  • App availability differs between sim and phone: drawing tasks (Freeform), gesture-game tasks (Clash Royale, etc.), and any third-party app tasks are phone-only — the iOS Simulator only includes pre-installed Apple apps. See the "App coverage" note in the simulator section above.
  • For --backend direct-http-sim, the start-sim-runner.sh terminal must stay open for the full run.
  • Use --dry-run to size the matrix before long runs — --tasks all --seeds 5 --models a,b,c is (tasks × 5 × 3) cells and can run for hours.

Repository Layout

agent.mjs                     Tool-based agent loop (Claude tool calling)
frontend/server.mjs           HTTP server: /task, /status, /respond, /stop
ios/MobileAgentCompanion/     Swift companion app + Dynamic Island Live Activity
memories/                     Persistent semantic memory (CoALA-style)
logs/tasks.jsonl              Episodic memory — every task run
docs/                         Deep dives on individual subsystems
benchmarks/demos/             Per-demo run records with reasoning traces
benchmarks/plots/             Deterministic plot-regeneration scripts
public/                       Images, videos, generated figures
CLAUDE.md                     Instructions for Claude Code in this repo
RESEARCH.md                   Research deep-dive — paper framing, citation, reproducibility
CITATION.cff                  GitHub citation metadata
LICENSE                       MIT

For Researchers

This project is an open research platform for agentic automation on physical iOS and iOS Simulator — built so other researchers can extend and advance the field. Early results from our own work already show a 3.40× speedup from rolling-summary context compression and a 10–100× action-latency reduction for agent-driven dynamic commands. We're just getting started.

If you're here for the research angle, not the hackathon, jump to:

  • RESEARCH.md — contributions, reproducibility steps, related work, research directions we're exploring
  • benchmarks/demos/ — per-demo run records with full reasoning traces (e.g., the triangle demo's verbatim agent thoughts + HITL decision analysis)
  • benchmarks/plots/plot_optimizations.py — deterministic regeneration of every figure in this README
  • docs/ — subsystem deep-dives (memory architecture, loop optimization, screen understanding, procedural memory, etc.)
  • CITATION.cff — machine-readable citation metadata; GitHub auto-renders a "Cite this repository" button
  • CONTRIBUTING.md — how to add benchmark tasks, evaluators, and open PRs

How to cite: see RESEARCH.md#how-to-cite-this-work. Short form — cite the repo via CITATION.cff.

Platform support (v0.1)

We're honest about the scope:

Surface Supported in v0.1 Notes
iPhone (physical device) Primary target. Full Dynamic Island HITL UI via MobileAgentCompanion.
iOS Simulator Fast path via --backend direct-http-sim. Apple-bundled apps only (no App Store).
iPad ⚠️ Partial Agent core works, but the companion app's HITL UI is iPhone-only — iPads don't have Dynamic Island. Full iPad HITL (modal / alert-based fallback) is planned for v0.2.
Mac Catalyst Not a target.
visionOS / watchOS Not a target.

If you need iPad today: the agent loop (agent.mjs + Maestro bridge) runs against an iPad fine — it's the same XCTest surface. You just won't get the Dynamic Island companion HITL UI; you'd need to use the terminal-driven askUser path in agent.mjs instead. If you're interested in contributing an iPad-native HITL UI, see CONTRIBUTING.md — this is an explicit good-first-contribution area.

Customizing which tools the agent has access to

Not every researcher has API keys for every optional tool. Disable specific tools at startup:

# Run without web search (no BRAVE_API_KEY needed):
node agent.mjs "search for USC on Maps" --backend direct-http-sim --disable-tools webSearch

# Disable multiple tools (e.g., for ablation studies):
node agent.mjs "draw a circle in Freeform" --phone --disable-tools webSearch,googleSearch,recallMemory

Do NOT disable the terminal flow-control tools (taskComplete, taskFailed) — the agent loop needs them to terminate.


Built On (Open-Source Acknowledgments)

This platform is built on top of the work of others. To be explicit about what we use vs. what we contribute:

  • Maestro (mobile.dev) — open-source mobile UI testing framework. The XCTest runner (maestro-ios-xctest-runner) that exposes HTTP endpoints on the device/simulator is Maestro's. We talk to it directly, bypassing the Maestro CLI, for the 10–100× action-latency win on agent-driven dynamic workflows.
  • DeviceLab maestro-ios-device — community binary that builds Maestro's XCTest runner for a physical iOS device and port-forwards localhost:6001 → device :22087 over USB (usbmuxd). This is the bridge our --phone mode relies on. Built from Maestro PR #2856.
  • Anthropic Claude — the reasoning brain (prompt caching, tool use, vision).
  • Apple XCUITest / XCTest — the underlying iOS automation APIs the Maestro runner wraps.

What's ours: the AI agent loop (agent.mjs), the on-device Swift agent, the drag primitive used as an autonomous agent tool, the gestural task benchmark, the Dynamic Island agent UX, the CoALA-style memory stack applied to mobile agents, the prompt-caching + screenshot-compression + rolling-summary optimization pipeline, and the cross-model LLM comparison harness.


The Team

Bryan Ramirez-Gonzalez Kenny Frias Zachary Chung Victoria Rojas
AI Researcher @ USC, NVIDIA AI Researcher @ Columbia University Network / Systems Researcher @ USC Hardware & Systems Researcher @ USC

The team on stage accepting 3rd place at the Claude Builder Hackathon

🥉 3rd Place — Claude Builder Hackathon, UCLA, April 2026. Presenting Agentic Automation on iOS.

Built at UCLA · April 2026 · for the Claude Builder Hackathon.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors