WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
WindowsWorld is a computer-use benchmark in cross-application workflows designed to systematically assess GUI Agents on complex multistep tasks that mirror real-world professional activities.
Fig. 1 Comparison of execution-based benchmarks. “Multi-app” indicates tasks with two or more applications; “Intermediate Checks” indicates tasks with intermediate-state checkpoints rather than result-only end-state
evaluation. WindowsWorld contains the most apps and focuses on multi-app tasks across desktop benchmarks.
- 181 tasks across 17 desktop applications
- 4 difficulty levels (L1–L4): 21.5% / 44.2% / 27.6% / 6.6%
- 77.9% multi-app tasks, reflecting realistic cross-application workflows
- App-count distribution: 22.1% / 23.8% / 47.5% / 5.5% / 1.1% for tasks involving 1 / 2 / 3 / 4 / 5 apps
- 4.97 intermediate checkpoints per task on average for process-aware evaluation
- Grounded in 16 professional personas and diverse real-world office scenarios
Fig. 2 Model and Agents Performance on our WindowsWorld. All large models use a unified PyAutoGUI action space, while UiPath employs the Computer_13 action space from OSWorld. Pure models are evaluated under Screenshot, Screenshot + Accessibility Tree, and Set-of-Mark inputs; Agent-based systems (S3 and UiPath) use Screenshot input. Moreover, S3 and UiPath are integrated UI-TARS-1.5-7B as a grounding model. Each task is executed under a fixed maximum step budget that depends on task level: 15 (L1), 25 (L2), 40 (L3), and 20 (L4).
Support: Windows 10/11, Windows Server 2022/2025
You may need to sign up a Broadcom account to download the software (free). Any version is OK.
Notice: newer versions do not support Chinese.
Require vmrun in PATH.
Default installation path is C:\Program Files (x86)\VMware\VMware Workstation\vmrun.exe, check it by:
vmrunIt should print the usage of vmrun if it's correctly installed and added to PATH.
Requires (and validated on) Python 3.11+.
First, clone this repository:
git clone https://github.com/HITsz-TMG/WindowsWorld.git
cd WindowsWorldThen, install dependencies:
# Create and activate a new conda environment
conda create -n windowsworld python=3.11 -y
conda activate windowsworld
# Install dependencies
pip install -r requirements.txtSet the environment variables for keys:
| Model Type | KEY | URL |
|---|---|---|
| GPT | OPENAI_API_KEY |
OPENAI_API_BASE |
| Gemini | GEMINI_API_KEY |
GEMINI_API_BASE |
| Claude | ANTHROPIC_API_KEY |
ANTHROPIC_API_BASE |
| Qwen | QWEN_API_KEY |
QWEN_API_BASE |
Import the virtual machine by following this guide: ./Installation Guide.md.
The virtual machine's folder structure should be like this:
D:\Virtual Machines
├── Windows0
│ ├── Windows0-disk1.vmx
│ ├── Windows0.vmdk
│ └── ...
├── Windows1
│ ├── Windows1-disk1.vmx
│ ├── ...
python hf_run.py
-b benchmark.json \
-v path_to_vm_image_folder \
-m model_name \
-a pyautogui/computer_13 \
-o screenshot/som/a11y/screenshot_a11y \
-c parallel_countpath_to_vm_image_folderis the folder that contains the VM image you downloaded, such as:D:\Virtual Machines\WindowsWorld.model_nameliterally decides which model api to use (in code).-ais action space:pyautoguiis to directly usePyAutoGUI;computer_13accords to this file:./mm_agents/prompts.py (line 44).
-ois observation type:screenshotis to only use screenshot as observation;a11yis to only use accessibility information as observation;screenshot_a11yis to use both screenshot and accessibility information as observation;somis Set-of-Mark.
Example:
python hf_run.py \
-b benchmark.json \
-v "D:\Virtual Machines" \
-m gemini-3-flash-preview \
-a computer_13 \
-o screenshot \
-c 1You can use the following commands to summarize results after running the benchmark:
# Default: read results from ./hf_result
python show_result.pyFigure 3: Benchmark analysis of WindowsWorld. (a) Distribution of tasks across difficulty levels (L1-L4), highlighting the prevalence of non-trivial multi-step workflows. (b) Distribution of the number of applications per task, highlighting the prevalence of multi-app workflows. (c) Distribution of task checkpoints by difficulty (L1-L3), showing increased checkpoint density for complex tasks.
This project builds upon OSWorld. A substantial portion of the evaluation framework is derived from or adapted from OSWorld. We thank the OSWorld authors for open-sourcing their benchmark and infrastructure.
The OSWorld-derived portions of this repository remain subject to the Apache License 2.0.
Please cite our paper if you find this benchmark useful for your research:
@inproceedings{li2026windowsworld,
title={WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments},
author={Jinchao Li and Yunxin Li and Chenrui Zhao and Zhenran Xu and Baotian Hu and Min Zhang},
booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
year={2026},
url={https://openreview.net/forum?id=qDZP06FdPl}
}


