arxiv:2604.27776

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Published on Apr 30

· Submitted by

Jinchao Li on May 5

Upvote

Authors:

Jinchao Li ,

Abstract

A cross-application workflow benchmark named WindowsWorld was developed to evaluate GUI agents on complex multi-step tasks requiring coordination across multiple software applications, revealing significant performance gaps in current agents when handling real-world professional workflows.

AI-generated summary

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across geq 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

View arXiv page View PDF GitHub 5 Add to collection

Community

Jinli4869

Paper author Paper submitter about 4 hours ago

Sharing our ACL Findings paper, WindowsWorld.

TL;DR: We introduce a process-centric benchmark for evaluating GUI agents on professional cross-application Windows workflows.

WindowsWorld contains 181 tasks across 17 desktop applications, with 77.9% multi-application tasks and intermediate checkpoints for measuring partial progress. Experiments show that current GUI agents still struggle with long-horizon cross-application coordination, with the best setting reaching about 20% final success.

Code/data: https://github.com/HITsz-TMG/WindowsWorld