Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Torr Vision Group @ Oxford University, Show Lab @ National University of Singapore, Microsoft
📄 Paper | 🤗 HF Daily Paper | 🤗 HF Demo | 🌐 Project
What does Agent-friendly UI look like? Check out below demo:
aui_demo_video_h200.mp4
The left UI is designed for 🧑🏻💻humans—prioritizing aesthetics.
The right UI is redesigned for 🤖agents—focused on clarity and functionality.
- [2025.11.20] Huggingface Demo is released.
- [2025.11.19] Arxiv paper is released.
- [2025.10.30] Code is released.
Can Computer-Use Agents offer feedback to assist Coders to Generate UI?
- Use Python 3.10+ in an isolated environment:
conda create -n aui python=3.10 conda activate aui
- Install dependencies and Playwright browsers:
pip install -r requirements.txt python -m playwright install
- Local model servers are recommended (e.g., using VLLM).
- Edit
configs/models.yamlto point to your endpoints:- Coder:
Qwen3-Coder-30B(http://localhost:8001/v1) - CUA:
UI-TARS-1.5-7B(http://localhost:8002/v1) - Verifier:
GPT-5orQwen2.5-VL-72B
- Coder:
- Export API keys (if using proprietary models):
export AZURE_OPENAI_API_KEY="YOUR_KEY"
Our AUI contains the following stages.
Pipeline:
- 0️⃣ Preparation: Generate initial websites and tasks per app.
- 1️⃣ Task Solvability Check: Judge extracts task-state rules on initial websites to determine task validity.
- 2️⃣ CUA Navigation Test: CUA executes supported tasks; oracle evaluation is rule-based.
- 3️⃣ Iterative Refinement:
- Revise: Update websites based on unsupported tasks (Task Solvability Feedback) and CUA failures (Navigation Feedback via Dashboard).
- ReJudge: Re-evaluate task solvability on revised websites.
- ReTest: CUA executes tasks on revised websites.
For normal usage, you only need the single entrypoint run.py from the repo root:
cd betterui_release
/users/husiyuan/miniconda3/envs/ui/bin/python run.py \
--models gpt5,qwen,gpt4o \
--apps all \
--v0-dir full_52_apps \
--experiment exp_integrated \
--revision-type integrated \
--cua-models uitarsThis command sequentially runs Stage 0 → Stage 3. Advanced users who want full control over each stage can expand the section below to see what each stage does and the corresponding commands.
Show Stage 0–3 details
Stage 0 – Preparation
src/stage0_generate_websites.py: generate initial v0 websites for all apps and coder models.src/stage0_generate_tasks.py: generate 30 tasks per app (GPT-5) based on app labels.
Stage 1 – Judge v0 (Task Solvability)
src/stage1_judge_v0.py: Judge extracts state and completion rules for each task on v0 websites.
Stage 2 – CUA Test v0 (Navigation)
src/stage2_cua_test_v0.py: CUA runs only supported tasks (with rules) on v0 websites; success is evaluated by rules (oracle).
Stage 3 – Revision + Re-eval
src/stage3_0_revise.py: revise websites using unsupported-task feedback, CUA failures, or integrated signals.src/stage3_1_judge_v1.py: re-run judge on v1 websites to update task support.src/stage3_2_cua_test_v1.py: re-run CUA on v1 websites with oracle evaluation.
1) Generate Initial Websites (3 coder models × 52 apps)
python src/stage0_generate_websites.py \
--models gpt5,qwen,gpt4o \
--apps all \
--v0-dir full_52_apps2) Generate Tasks (30 tasks per app via GPT-5)
python src/stage0_generate_tasks.py \
--apps all \
--v0-dir full_52_apps3) Metric 1: Judge Initial Websites (Task Solvability)
python src/stage1_judge_v0.py \
--models gpt5,qwen,gpt4o \
--apps all \
--v0-dir full_52_apps4) Metric 2: CUA Navigation Test (Initial)
python src/stage2_cua_test_v0.py \
--models gpt5,qwen,gpt4o \
--apps all \
--v0-dir full_52_apps \
--cua-models uitars5) Stage 3: Iterative Refinement (Choose a revision strategy)
-
Option A: CUA Revision (Fix based on navigation failures)
python src/stage3_0_revise.py \ --experiment exp_cua_fix \ --models gpt5,qwen,gpt4o \ --apps all \ --revision-type cua \ --v0-dir full_52_apps
-
Option B: Unsupported Task Revision (Fix based on missing features)
python src/stage3_0_revise.py \ --experiment exp_func_fix \ --models gpt5,qwen,gpt4o \ --apps all \ --revision-type unsupported \ --v0-dir full_52_apps
-
Option C: Integrated Revision (Combine both – Recommended)
python src/stage3_0_revise.py \ --experiment exp_integrated \ --models gpt5,qwen,gpt4o \ --apps all \ --revision-type integrated \ --v0-dir full_52_apps
6) Re-evaluate Revised Websites
# Re-Judge Task Solvability
python src/stage3_1_judge_v1.py \
--experiment exp_integrated \
--models gpt5,qwen,gpt4o \
--apps all \
--revision-type integrated \
--v0-dir full_52_apps
# Re-Run CUA Navigation Test
python src/stage3_2_cua_test_v1.py \
--experiment exp_integrated \
--models gpt5,qwen,gpt4o \
--apps all \
--revision-type integrated \
--cua-models uitars \
--v0-dir full_52_appsInitial Data (Stage 0-2)
v0/{v0_dir}/
websites/{app}/{model}/index.html # Initial Generated Websites
tasks/{app}/
tasks.json # Generated Tasks
states/{model}/rules.json # Stage 1: Validation Rules
v0_cua_results/{model}/{cua_model}/ # Stage 2: CUA Results
results.json
trajectories/task_{i}/step_*.png|json # Trajectories
Experiments (Stage 3)
experiments/{experiment}/
runs/{run_key}/
stage3_0/{app}/{model}/v1_website/index.html # Revised Websites
stage3_1/{app}/{model}/rules.json # Revised Rules
stage3_2/{cua_model}/{app}/{model}/
trajectories/{task_id}/ # New Trajectories
run_summary.json
- Function Completeness (FC): Percentage of tasks that are functionally supported by the UI (determined by the Judge).
- CUA Success Rate (SR): Percentage of valid tasks successfully completed by the CUA.
Key Components:
- Verifier: A GPT-5 based judge that extracts rule-based checks to validate task solvability.
- CUA Dashboard: A visual summary tool that compresses long interaction trajectories into a single image, highlighting key failure points for the Coder.
- Revision Strategies:
unsupported: Adds missing features for unsolvable tasks.cua: Fixes usability issues preventing agent navigation (destylization, simplification).integrated: Combines both for maximum performance.
If you find this project helpful, please consider citing our paper:
@misc{lin2025aui,
title={Computer-Use Agents as Judges for Generative User Interface},
author={Kevin Qinghong Lin and Siyuan Hu and Linjie Li and Zhengyuan Yang and Lijuan Wang and Philip Torr and Mike Zheng Shou},
year={2025},
eprint={2511.15567},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15567},
}- Apps are adapted from OpenAI's coding examples.
- Thanks to the open-source community for browser automation (Playwright) and agent tooling.
