WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

WindowsWorld is a computer-use benchmark in cross-application workflows designed to systematically assess GUI Agents on complex multistep tasks that mirror real-world professional activities.

Fig. 1 Comparison of execution-based benchmarks. “Multi-app” indicates tasks with two or more applications; “Intermediate Checks” indicates tasks with intermediate-state checkpoints rather than result-only end-state evaluation. WindowsWorld contains the most apps and focuses on multi-app tasks across desktop benchmarks.

181 tasks across 17 desktop applications
4 difficulty levels (L1–L4): 21.5% / 44.2% / 27.6% / 6.6%
77.9% multi-app tasks, reflecting realistic cross-application workflows
App-count distribution: 22.1% / 23.8% / 47.5% / 5.5% / 1.1% for tasks involving 1 / 2 / 3 / 4 / 5 apps
4.97 intermediate checkpoints per task on average for process-aware evaluation
Grounded in 16 professional personas and diverse real-world office scenarios

📊 Experimental Results

Fig. 2 Model and Agents Performance on our WindowsWorld. All large models use a unified PyAutoGUI action space, while UiPath employs the Computer_13 action space from OSWorld. Pure models are evaluated under Screenshot, Screenshot + Accessibility Tree, and Set-of-Mark inputs; Agent-based systems (S3 and UiPath) use Screenshot input. Moreover, S3 and UiPath are integrated UI-TARS-1.5-7B as a grounding model. Each task is executed under a fixed maximum step budget that depends on task level: 15 (L1), 25 (L2), 40 (L3), and 20 (L4). $S_{\mathrm{int}}$ averages L1–L3 intermediate checkpoints and $S_{\mathrm{final}}$ averages L1–L4 final task completion.

💾 Installation

Support: Windows 10/11, Windows Server 2022/2025

1. Set up VMWare Workstation Pro

Official Website

Onedrive: ver. 25H2

You may need to sign up a Broadcom account to download the software (free). Any version is OK.

Notice: newer versions do not support Chinese.

Require vmrun in PATH.

Default installation path is C:\Program Files (x86)\VMware\VMware Workstation\vmrun.exe, check it by:

vmrun

It should print the usage of vmrun if it's correctly installed and added to PATH.

2. Set up environment

Requires (and validated on) Python 3.11+.

First, clone this repository:

git clone https://github.com/HITsz-TMG/WindowsWorld.git
cd WindowsWorld

Then, install dependencies:

# Create and activate a new conda environment
conda create -n windowsworld python=3.11 -y
conda activate windowsworld
# Install dependencies
pip install -r requirements.txt

3. API Key Configure

Set the environment variables for keys:

Model Type	KEY	URL
GPT	`OPENAI_API_KEY`	`OPENAI_API_BASE`
Gemini	`GEMINI_API_KEY`	`GEMINI_API_BASE`
Claude	`ANTHROPIC_API_KEY`	`ANTHROPIC_API_BASE`
Qwen	`QWEN_API_KEY`	`QWEN_API_BASE`

4. VM Image

Import the virtual machine by following this guide: ./Installation Guide.md.

The virtual machine's folder structure should be like this:

D:\Virtual Machines
├── Windows0
│   ├── Windows0-disk1.vmx
│   ├── Windows0.vmdk
│   └── ...
├── Windows1
│   ├── Windows1-disk1.vmx
│   ├── ...

🚀 Getting Started

Run the Benchmark:

python hf_run.py
    -b benchmark.json \
    -v path_to_vm_image_folder \
    -m model_name \
    -a pyautogui/computer_13 \
    -o screenshot/som/a11y/screenshot_a11y \
    -c parallel_count

path_to_vm_image_folder is the folder that contains the VM image you downloaded, such as: D:\Virtual Machines\WindowsWorld.
model_name literally decides which model api to use (in code).
-a is action space:
- pyautogui is to directly use PyAutoGUI;
- computer_13 accords to this file: ./mm_agents/prompts.py (line 44).
-o is observation type:
- screenshot is to only use screenshot as observation;
- a11y is to only use accessibility information as observation;
- screenshot_a11y is to use both screenshot and accessibility information as observation;
- som is Set-of-Mark.

Example:

python hf_run.py \
  -b benchmark.json \
  -v "D:\Virtual Machines" \
  -m gemini-3-flash-preview \
  -a computer_13 \
  -o screenshot \
  -c 1

Show Results

You can use the following commands to summarize results after running the benchmark:

# Default: read results from ./hf_result
python show_result.py

🎯 Benchmark Statistics

Figure 3: Benchmark analysis of WindowsWorld. (a) Distribution of tasks across difficulty levels (L1-L4), highlighting the prevalence of non-trivial multi-step workflows. (b) Distribution of the number of applications per task, highlighting the prevalence of multi-app workflows. (c) Distribution of task checkpoints by difficulty (L1-L3), showing increased checkpoint density for complex tasks.

Acknowledgement

This project builds upon OSWorld. A substantial portion of the evaluation framework is derived from or adapted from OSWorld. We thank the OSWorld authors for open-sourcing their benchmark and infrastructure.

The OSWorld-derived portions of this repository remain subject to the Apache License 2.0.

Citation

Please cite our paper if you find this benchmark useful for your research:

@inproceedings{li2026windowsworld,
  title={WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments},
  author={Jinchao Li and Yunxin Li and Chenrui Zhao and Zhenran Xu and Baotian Hu and Min Zhang},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2026},
  year={2026},
  url={https://openreview.net/forum?id=qDZP06FdPl}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
desktop_env		desktop_env
mm_agents		mm_agents
monitor		monitor
.gitignore		.gitignore
Installation Guide.md		Installation Guide.md
LICENSE		LICENSE
README.md		README.md
benchmark.json		benchmark.json
hf_run.py		hf_run.py
main.py		main.py
requirements.txt		requirements.txt
show_result.py		show_result.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

📊 Experimental Results

💾 Installation

1. Set up VMWare Workstation Pro

2. Set up environment

3. API Key Configure

4. VM Image

🚀 Getting Started

Run the Benchmark:

Show Results

🎯 Benchmark Statistics

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

📊 Experimental Results

💾 Installation

1. Set up VMWare Workstation Pro

2. Set up environment

3. API Key Configure

4. VM Image

🚀 Getting Started

Run the Benchmark:

Show Results

🎯 Benchmark Statistics

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages