Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models Paper β’ 2604.16593 β’ Published 21 days ago β’ 6
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Paper β’ 2603.07980 β’ Published Mar 9 β’ 27
MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier Paper β’ 2603.03756 β’ Published Mar 4 β’ 89
Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs Paper β’ 2508.19594 β’ Published Aug 27, 2025 β’ 3
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts Paper β’ 2602.14060 β’ Published Feb 15 β’ 2
TongSIM: A General Platform for Simulating Intelligent Machines Paper β’ 2512.20206 β’ Published Dec 23, 2025 β’ 28
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection Paper β’ 2505.16475 β’ Published May 22, 2025 β’ 3
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling Paper β’ 2506.08672 β’ Published Jun 10, 2025 β’ 30
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning Paper β’ 2512.07461 β’ Published Dec 8, 2025 β’ 79
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space Paper β’ 2505.13308 β’ Published May 19, 2025 β’ 27
Absolute Zero: Reinforced Self-play Reasoning with Zero Data Paper β’ 2505.03335 β’ Published May 6, 2025 β’ 191
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts Paper β’ 2503.22952 β’ Published Mar 29, 2025 β’ 17
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models Paper β’ 2405.02861 β’ Published May 5, 2024 β’ 1
MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models Paper β’ 2308.09729 β’ Published Aug 17, 2023 β’ 6