Agent-friendly UI

Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Torr Vision Group @ Oxford University, Show Lab @ National University of Singapore, Microsoft

📄 Paper | 🤗 HF Daily Paper | 🤗 HF Demo | 🌐 Project

What does Agent-friendly UI look like? Check out below demo:

aui_demo_video_h200.mp4

The left UI is designed for 🧑🏻‍💻humans—prioritizing aesthetics.

The right UI is redesigned for 🤖agents—focused on clarity and functionality.

🔥 Update

[2025.11.20] Huggingface Demo is released.
[2025.11.19] Arxiv paper is released.
[2025.10.30] Code is released.

📖 TL;DR

Can Computer-Use Agents offer feedback to assist Coders to Generate UI?

⚙️ Environments

1. Requirements

Use Python 3.10+ in an isolated environment:

conda create -n aui python=3.10
conda activate aui

Install dependencies and Playwright browsers:

pip install -r requirements.txt
python -m playwright install

2. Configure Models

Local model servers are recommended (e.g., using VLLM).
Edit configs/models.yaml to point to your endpoints:
- Coder: Qwen3-Coder-30B (http://localhost:8001/v1)
- CUA: UI-TARS-1.5-7B (http://localhost:8002/v1)
- Verifier: GPT-5 or Qwen2.5-VL-72B
Export API keys (if using proprietary models):
```
export AZURE_OPENAI_API_KEY="YOUR_KEY"
```

🚀 Quick Start

Our AUI contains the following stages.

Pipeline:

0️⃣ Preparation: Generate initial websites and tasks per app.
1️⃣ Task Solvability Check: Judge extracts task-state rules on initial websites to determine task validity.
2️⃣ CUA Navigation Test: CUA executes supported tasks; oracle evaluation is rule-based.
3️⃣ Iterative Refinement:
1. Revise: Update websites based on unsupported tasks (Task Solvability Feedback) and CUA failures (Navigation Feedback via Dashboard).
2. ReJudge: Re-evaluate task solvability on revised websites.
3. ReTest: CUA executes tasks on revised websites.

For normal usage, you only need the single entrypoint run.py from the repo root:

cd betterui_release
/users/husiyuan/miniconda3/envs/ui/bin/python run.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --experiment exp_integrated \
  --revision-type integrated \
  --cua-models uitars

This command sequentially runs Stage 0 → Stage 3. Advanced users who want full control over each stage can expand the section below to see what each stage does and the corresponding commands.

Show Stage 0–3 details

Stage 0 – Preparation

src/stage0_generate_websites.py: generate initial v0 websites for all apps and coder models.
src/stage0_generate_tasks.py: generate 30 tasks per app (GPT-5) based on app labels.

Stage 1 – Judge v0 (Task Solvability)

src/stage1_judge_v0.py: Judge extracts state and completion rules for each task on v0 websites.

Stage 2 – CUA Test v0 (Navigation)

src/stage2_cua_test_v0.py: CUA runs only supported tasks (with rules) on v0 websites; success is evaluated by rules (oracle).

Stage 3 – Revision + Re-eval

src/stage3_0_revise.py: revise websites using unsupported-task feedback, CUA failures, or integrated signals.
src/stage3_1_judge_v1.py: re-run judge on v1 websites to update task support.
src/stage3_2_cua_test_v1.py: re-run CUA on v1 websites with oracle evaluation.

1) Generate Initial Websites (3 coder models × 52 apps)

python src/stage0_generate_websites.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

2) Generate Tasks (30 tasks per app via GPT-5)

python src/stage0_generate_tasks.py \
  --apps all \
  --v0-dir full_52_apps

3) Metric 1: Judge Initial Websites (Task Solvability)

python src/stage1_judge_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

4) Metric 2: CUA Navigation Test (Initial)

python src/stage2_cua_test_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --cua-models uitars

5) Stage 3: Iterative Refinement (Choose a revision strategy)

Option A: CUA Revision (Fix based on navigation failures)

python src/stage3_0_revise.py \
  --experiment exp_cua_fix \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type cua \
  --v0-dir full_52_apps

Option B: Unsupported Task Revision (Fix based on missing features)

python src/stage3_0_revise.py \
  --experiment exp_func_fix \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type unsupported \
  --v0-dir full_52_apps

Option C: Integrated Revision (Combine both – Recommended)

python src/stage3_0_revise.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --v0-dir full_52_apps

6) Re-evaluate Revised Websites

# Re-Judge Task Solvability
python src/stage3_1_judge_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --v0-dir full_52_apps

# Re-Run CUA Navigation Test
python src/stage3_2_cua_test_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --cua-models uitars \
  --v0-dir full_52_apps

🗂️ Data Layout

Initial Data (Stage 0-2)

v0/{v0_dir}/
  websites/{app}/{model}/index.html         # Initial Generated Websites
  tasks/{app}/
    tasks.json                              # Generated Tasks
    states/{model}/rules.json               # Stage 1: Validation Rules
    v0_cua_results/{model}/{cua_model}/     # Stage 2: CUA Results
      results.json
      trajectories/task_{i}/step_*.png|json # Trajectories

Experiments (Stage 3)

experiments/{experiment}/
  runs/{run_key}/
    stage3_0/{app}/{model}/v1_website/index.html    # Revised Websites
    stage3_1/{app}/{model}/rules.json               # Revised Rules
    stage3_2/{cua_model}/{app}/{model}/
      trajectories/{task_id}/                       # New Trajectories
      run_summary.json

📏 Evaluations

Function Completeness (FC): Percentage of tasks that are functionally supported by the UI (determined by the Judge).
CUA Success Rate (SR): Percentage of valid tasks successfully completed by the CUA.

Key Components:

Verifier: A GPT-5 based judge that extracts rule-based checks to validate task solvability.
CUA Dashboard: A visual summary tool that compresses long interaction trajectories into a single image, highlighting key failure points for the Coder.
Revision Strategies:
- unsupported: Adds missing features for unsolvable tasks.
- cua: Fixes usability issues preventing agent navigation (destylization, simplification).
- integrated: Combines both for maximum performance.

🎓 Citations

If you find this project helpful, please consider citing our paper:

@misc{lin2025aui,
      title={Computer-Use Agents as Judges for Generative User Interface}, 
      author={Kevin Qinghong Lin and Siyuan Hu and Linjie Li and Zhengyuan Yang and Lijuan Wang and Philip Torr and Mike Zheng Shou},
      year={2025},
      eprint={2511.15567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15567}, 
}

🙏 Acknowledgements

Apps are adapted from OpenAI's coding examples.
Thanks to the open-source community for browser automation (Playwright) and agent tooling.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
agents		agents
assets		assets
configs		configs
examples		examples
revision_components		revision_components
src		src
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent-friendly UI

🔥 Update

📖 TL;DR

⚙️ Environments

1. Requirements

2. Configure Models

🚀 Quick Start

🗂️ Data Layout

📏 Evaluations

🎓 Citations

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

showlab/AUI

Folders and files

Latest commit

History

Repository files navigation

Agent-friendly UI

🔥 Update

📖 TL;DR

⚙️ Environments

1. Requirements

2. Configure Models

🚀 Quick Start

🗂️ Data Layout

📏 Evaluations

🎓 Citations

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages