CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 4 days ago • 31
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 4 days ago • 31
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Paper • 2606.02320 • Published 25 days ago • 14
Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning Paper • 2606.07602 • Published 28 days ago • 6
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models Paper • 2604.18224 • Published Apr 20 • 22
OProver: A Unified Framework for Agentic Formal Theorem Proving Paper • 2605.17283 • Published May 17 • 31
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding Paper • 2603.27064 • Published Mar 28 • 29
Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality Paper • 2604.04418 • Published Apr 6 • 1
Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality Paper • 2604.04418 • Published Apr 6 • 1
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding Paper • 2603.27064 • Published Mar 28 • 29
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Paper • 2601.06002 • Published Jan 9 • 60
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 52
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies Paper • 2602.09514 • Published Feb 10 • 11
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Paper • 2602.22675 • Published Feb 26 • 23