KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Abstract
KnowU-Bench presents a comprehensive benchmark for personalized mobile agents that evaluates true preference inference and proactive assistance capabilities in real-world GUI environments.
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.
Community
We introduce KnowU-Bench, an online, interactive personalization benchmark for mobile agents built on a reproducible Android emulation environment.
We find that even frontier models like Claude Sonnet 4.6 struggle with our personalized and proactive tasks under vague instructions: the core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a large gap between GUI operation and trustworthy personal assistance.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent (2026)
- Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History (2026)
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents (2026)
- Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks (2026)
- LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation (2026)
- BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs (2026)
- PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08455 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper