WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models Paper • 2604.18224 • Published Apr 20 • 22
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding Paper • 2603.27064 • Published Mar 28 • 28
Justified or Just Convincing? Error Verifiability as a Dimension of LLM Quality Paper • 2604.04418 • Published Apr 6 • 1
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Paper • 2602.22675 • Published Feb 26 • 23
O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL Paper • 2601.03743 • Published Jan 7 • 3
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Paper • 2601.06002 • Published Jan 9 • 60
SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature Paper • 2601.10108 • Published Jan 15 • 7
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies Paper • 2602.09514 • Published Feb 10 • 11