SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use Paper • 2607.01874 • Published 2 days ago • 12
A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets Paper • 2606.13802 • Published 23 days ago • 1
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players Paper • 2605.28816 • Published May 27 • 431
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining Paper • 2605.14747 • Published May 14 • 147
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution Paper • 2605.18401 • Published May 18 • 130
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks Paper • 2605.10977 • Published May 9 • 10
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Paper • 2605.10912 • Published May 11 • 46
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time Paper • 2604.11626 • Published Apr 13 • 103
OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training Paper • 2603.28858 • Published Mar 30 • 9
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Paper • 2603.19835 • Published Mar 20 • 353
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models Paper • 2603.25716 • Published Mar 26 • 157
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Paper • 2602.10693 • Published Feb 11 • 221
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published Feb 13 • 246
NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control Paper • 2602.09070 • Published Feb 9 • 46