Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies Paper • 2605.03596 • Published 2 days ago • 2
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness Paper • 2605.02396 • Published 3 days ago • 11
ClawGym: A Scalable Framework for Building Effective Claw Agents Paper • 2604.26904 • Published 8 days ago • 49
On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning Paper • 2604.01702 • Published Apr 4 • 3
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings Paper • 2604.04323 • Published about 1 month ago • 41
On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning Paper • 2604.01702 • Published Apr 4 • 3
Embarrassingly Simple Self-Distillation Improves Code Generation Paper • 2604.01193 • Published Apr 1 • 47 • 7
LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning Paper • 2603.21065 • Published Mar 22 • 77
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities Paper • 2603.02578 • Published Mar 3 • 25
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs Paper • 2505.18573 • Published May 24, 2025
Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains? Paper • 2510.11184 • Published Oct 13, 2025 • 1
Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text Paper • 2601.10355 • Published Jan 15 • 39
Rethinking Expert Trajectory Utilization in LLM Post-training Paper • 2512.11470 • Published Dec 12, 2025 • 10 • 4