Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published May 8 • 1
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Paper • 2606.14832 • Published 12 days ago • 12
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models Paper • 2605.20873 • Published May 20 • 44
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo Paper • 2605.16257 • Published May 15 • 55
PanoWorld: Towards Spatial Supersensing in 360^circ Panorama World Paper • 2605.13169 • Published May 13 • 21
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward Paper • 2605.12495 • Published May 12 • 35
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection Paper • 2512.16905 • Published Dec 18, 2025 • 32
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning Paper • 2512.12799 • Published Dec 14, 2025 • 12
TokBench: Evaluating Your Visual Tokenizer before Visual Generation Paper • 2505.18142 • Published May 23, 2025 • 2
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data Paper • 2509.15221 • Published Sep 18, 2025 • 111