ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics Paper • 2606.10479 • Published 3 days ago • 18
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research Paper • 2606.07591 • Published 15 days ago • 85
Draft-OPD: On-Policy Distillation for Speculative Draft Models Paper • 2605.29343 • Published 15 days ago • 34
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation Paper • 2605.31264 • Published 14 days ago • 111
π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows Paper • 2605.14678 • Published 24 days ago • 105
ACC: Compiling Agent Trajectories for Long-Context Training Paper • 2605.21850 • Published 22 days ago • 59
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Paper • 2604.04771 • Published Apr 6 • 123
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale Paper • 2603.25040 • Published Mar 26 • 133
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports Paper • 2603.09896 • Published Mar 10 • 28
Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence Paper • 2603.07660 • Published Mar 8 • 87
The Trinity of Consistency as a Defining Principle for General World Models Paper • 2602.23152 • Published Feb 26 • 202
TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization Paper • 2601.16480 • Published Jan 23 • 50
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models Paper • 2601.14004 • Published Jan 20 • 49
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows Paper • 2512.16969 • Published Dec 18, 2025 • 120
SGI-Bench Collection Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows • 12 items • Updated May 6 • 33
Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning Paper • 2512.10534 • Published Dec 11, 2025 • 33
OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification Paper • 2512.10756 • Published Dec 11, 2025 • 35
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Paper • 2511.15705 • Published Nov 19, 2025 • 98
PICABench: How Far Are We from Physically Realistic Image Editing? Paper • 2510.17681 • Published Oct 20, 2025 • 65