P1: Mastering Physics Olympiads with Reinforcement Learning Paper • 2511.13612 • Published Nov 17, 2025 • 134
The End of Manual Decoding: Towards Truly End-to-End Language Models Paper • 2510.26697 • Published Oct 30, 2025 • 117
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published Oct 29, 2025 • 46
Glyph: Scaling Context Windows via Visual-Text Compression Paper • 2510.17800 • Published Oct 20, 2025 • 68
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? Paper • 2510.08189 • Published Oct 9, 2025 • 27
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Paper • 2510.05034 • Published Oct 6, 2025 • 50
From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Paper • 2509.25123 • Published Sep 29, 2025 • 21
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21, 2025 • 21
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16, 2025 • 71
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration Paper • 2509.14760 • Published Sep 18, 2025 • 53
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Paper • 2509.15194 • Published Sep 18, 2025 • 33