arxiv:2605.25160

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Published on May 24

· Submitted by

Guohong Liu on May 26

Tsinghua University

Upvote

Authors:

Guohong Liu ,

Abstract

A synthetic benchmark for mobile GUI agents with 120 challenging tasks is introduced, featuring high-fidelity virtual environments with automatic reward generation and revealing significant limitations in current agent performance on complex, long-horizon interactions.

AI-generated summary

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.

View arXiv page View PDF Add to collection

Community

Zacharyvixx

Paper author Paper submitter about 16 hours ago

This work introduces a fully synthetic benchmark for mobile GUI agents, designed to close the gap between existing evaluation settings and real-world mobile app usage. It generates backend-free webpage environments that simulate realistic mobile applications, along with valid automatic rewards, so tasks can be evaluated efficiently and reproducibly without heavy manual setup. The benchmark contains 120 tasks across 63 simulated app environments, covering simple, long-horizon, and math-related interactions drawn from diverse real-world usage scenarios. Experiments on state-of-the-art mobile GUI agents show that current systems still struggle substantially, especially on long-horizon tasks, which makes this benchmark useful both for measuring progress and for diagnosing where today’s agents still fall short.

avahal

about 4 hours ago

the two-stage synthetic app pipeline in simuwob is clever, but i worry about edge cases like non-deterministic ui flows and localization changes that could break validators or reward signals. an ablation on the human-in-the-loop repair vs fully automatic generation would help pin down how much reliability comes from human checks vs the codegen loop. btw the arxivlens breakdown does a nice job unpacking the method details and helped me follow the flow, for example https://arxivlens.com/PaperView/Details/simuwob-simulating-real-world-mobile-apps-for-fast-and-faithful-gui-agent-benchmarking-3027-4897d0c0. would these edge cases transfer when evaluating on unseen app families or real apps, i.e. does the synthetic set really generalize to more varied ui semantics?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.25160

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25160 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.25160 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25160 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.