WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
Abstract
A cross-application workflow benchmark named WindowsWorld was developed to evaluate GUI agents on complex multi-step tasks requiring coordination across multiple software applications, revealing significant performance gaps in current agents when handling real-world professional workflows.
While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across geq 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.
Community
Sharing our ACL Findings paper, WindowsWorld.
TL;DR: We introduce a process-centric benchmark for evaluating GUI agents on professional cross-application Windows workflows.
WindowsWorld contains 181 tasks across 17 desktop applications, with 77.9% multi-application tasks and intermediate checkpoints for measuring partial progress. Experiments show that current GUI agents still struggle with long-horizon cross-application coordination, with the best setting reaching about 20% final success.
Code/data: https://github.com/HITsz-TMG/WindowsWorld
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OSExpert: Computer-Use Agents Learning Professional Skills via Exploration (2026)
- EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings (2026)
- OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation (2026)
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026)
- Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? (2026)
- GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents (2026)
- AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.27776 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper