Skip to content

showlab/AUI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent-friendly UI

Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Torr Vision Group @ Oxford University, Show Lab @ National University of Singapore, Microsoft

📄 Paper   |   🤗 HF Daily Paper   |   🤗 HF Demo   |   🌐 Project

What does Agent-friendly UI look like? Check out below demo:

aui_demo_video_h200.mp4

The left UI is designed for 🧑🏻‍💻humans—prioritizing aesthetics.

The right UI is redesigned for 🤖agents—focused on clarity and functionality.

🔥 Update

  • [2025.11.20] Huggingface Demo is released.
  • [2025.11.19] Arxiv paper is released.
  • [2025.10.30] Code is released.

📖 TL;DR

Can Computer-Use Agents offer feedback to assist Coders to Generate UI?


⚙️ Environments

1. Requirements

  • Use Python 3.10+ in an isolated environment:
    conda create -n aui python=3.10
    conda activate aui
  • Install dependencies and Playwright browsers:
    pip install -r requirements.txt
    python -m playwright install

2. Configure Models

  • Local model servers are recommended (e.g., using VLLM).
  • Edit configs/models.yaml to point to your endpoints:
    • Coder: Qwen3-Coder-30B (http://localhost:8001/v1)
    • CUA: UI-TARS-1.5-7B (http://localhost:8002/v1)
    • Verifier: GPT-5 or Qwen2.5-VL-72B
  • Export API keys (if using proprietary models):
    export AZURE_OPENAI_API_KEY="YOUR_KEY"

🚀 Quick Start

Our AUI contains the following stages.

Pipeline:

  • 0️⃣ Preparation: Generate initial websites and tasks per app.
  • 1️⃣ Task Solvability Check: Judge extracts task-state rules on initial websites to determine task validity.
  • 2️⃣ CUA Navigation Test: CUA executes supported tasks; oracle evaluation is rule-based.
  • 3️⃣ Iterative Refinement:
    1. Revise: Update websites based on unsupported tasks (Task Solvability Feedback) and CUA failures (Navigation Feedback via Dashboard).
    2. ReJudge: Re-evaluate task solvability on revised websites.
    3. ReTest: CUA executes tasks on revised websites.

For normal usage, you only need the single entrypoint run.py from the repo root:

cd betterui_release
/users/husiyuan/miniconda3/envs/ui/bin/python run.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --experiment exp_integrated \
  --revision-type integrated \
  --cua-models uitars

This command sequentially runs Stage 0 → Stage 3. Advanced users who want full control over each stage can expand the section below to see what each stage does and the corresponding commands.

Show Stage 0–3 details

Stage 0 – Preparation

  • src/stage0_generate_websites.py: generate initial v0 websites for all apps and coder models.
  • src/stage0_generate_tasks.py: generate 30 tasks per app (GPT-5) based on app labels.

Stage 1 – Judge v0 (Task Solvability)

  • src/stage1_judge_v0.py: Judge extracts state and completion rules for each task on v0 websites.

Stage 2 – CUA Test v0 (Navigation)

  • src/stage2_cua_test_v0.py: CUA runs only supported tasks (with rules) on v0 websites; success is evaluated by rules (oracle).

Stage 3 – Revision + Re-eval

  • src/stage3_0_revise.py: revise websites using unsupported-task feedback, CUA failures, or integrated signals.
  • src/stage3_1_judge_v1.py: re-run judge on v1 websites to update task support.
  • src/stage3_2_cua_test_v1.py: re-run CUA on v1 websites with oracle evaluation.

1) Generate Initial Websites (3 coder models × 52 apps)

python src/stage0_generate_websites.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

2) Generate Tasks (30 tasks per app via GPT-5)

python src/stage0_generate_tasks.py \
  --apps all \
  --v0-dir full_52_apps

3) Metric 1: Judge Initial Websites (Task Solvability)

python src/stage1_judge_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps

4) Metric 2: CUA Navigation Test (Initial)

python src/stage2_cua_test_v0.py \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --v0-dir full_52_apps \
  --cua-models uitars

5) Stage 3: Iterative Refinement (Choose a revision strategy)

  • Option A: CUA Revision (Fix based on navigation failures)

    python src/stage3_0_revise.py \
      --experiment exp_cua_fix \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type cua \
      --v0-dir full_52_apps
  • Option B: Unsupported Task Revision (Fix based on missing features)

    python src/stage3_0_revise.py \
      --experiment exp_func_fix \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type unsupported \
      --v0-dir full_52_apps
  • Option C: Integrated Revision (Combine both – Recommended)

    python src/stage3_0_revise.py \
      --experiment exp_integrated \
      --models gpt5,qwen,gpt4o \
      --apps all \
      --revision-type integrated \
      --v0-dir full_52_apps

6) Re-evaluate Revised Websites

# Re-Judge Task Solvability
python src/stage3_1_judge_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --v0-dir full_52_apps

# Re-Run CUA Navigation Test
python src/stage3_2_cua_test_v1.py \
  --experiment exp_integrated \
  --models gpt5,qwen,gpt4o \
  --apps all \
  --revision-type integrated \
  --cua-models uitars \
  --v0-dir full_52_apps

🗂️ Data Layout

Initial Data (Stage 0-2)

v0/{v0_dir}/
  websites/{app}/{model}/index.html         # Initial Generated Websites
  tasks/{app}/
    tasks.json                              # Generated Tasks
    states/{model}/rules.json               # Stage 1: Validation Rules
    v0_cua_results/{model}/{cua_model}/     # Stage 2: CUA Results
      results.json
      trajectories/task_{i}/step_*.png|json # Trajectories

Experiments (Stage 3)

experiments/{experiment}/
  runs/{run_key}/
    stage3_0/{app}/{model}/v1_website/index.html    # Revised Websites
    stage3_1/{app}/{model}/rules.json               # Revised Rules
    stage3_2/{cua_model}/{app}/{model}/
      trajectories/{task_id}/                       # New Trajectories
      run_summary.json

📏 Evaluations

  1. Function Completeness (FC): Percentage of tasks that are functionally supported by the UI (determined by the Judge).
  2. CUA Success Rate (SR): Percentage of valid tasks successfully completed by the CUA.

Key Components:

  • Verifier: A GPT-5 based judge that extracts rule-based checks to validate task solvability.
  • CUA Dashboard: A visual summary tool that compresses long interaction trajectories into a single image, highlighting key failure points for the Coder.
  • Revision Strategies:
    • unsupported: Adds missing features for unsolvable tasks.
    • cua: Fixes usability issues preventing agent navigation (destylization, simplification).
    • integrated: Combines both for maximum performance.

🎓 Citations

If you find this project helpful, please consider citing our paper:

@misc{lin2025aui,
      title={Computer-Use Agents as Judges for Generative User Interface}, 
      author={Kevin Qinghong Lin and Siyuan Hu and Linjie Li and Zhengyuan Yang and Lijuan Wang and Philip Torr and Mike Zheng Shou},
      year={2025},
      eprint={2511.15567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15567}, 
}

🙏 Acknowledgements

  • Apps are adapted from OpenAI's coding examples.
  • Thanks to the open-source community for browser automation (Playwright) and agent tooling.

About

Computer-Use Agents as Judges for Generative UI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages