NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 44
AutoMV: An Automatic Multi-Agent System for Music Video Generation Paper • 2512.12196 • Published Dec 13, 2025 • 6
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements Paper • 2512.24867 • Published 26 days ago • 1
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space Paper • 2512.24617 • Published 26 days ago • 61
AInsteinBench: Benchmarking Coding Agents on Scientific Repositories Paper • 2512.21373 • Published Dec 24, 2025
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Paper • 2601.06002 • Published 17 days ago • 50
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning Paper • 2601.06002 • Published 17 days ago • 50
Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space Paper • 2512.24617 • Published 26 days ago • 61
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 44
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 44
How Far Are We from Genuinely Useful Deep Research Agents? Paper • 2512.01948 • Published Dec 1, 2025 • 56
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published Nov 23, 2025 • 294
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published Nov 23, 2025 • 294
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published Nov 5, 2025 • 7
RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization Paper • 2511.04285 • Published Nov 6, 2025 • 8
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs Paper • 2511.07250 • Published Nov 10, 2025 • 18
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains Paper • 2511.10984 • Published Nov 14, 2025 • 5
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds Paper • 2511.08892 • Published Nov 12, 2025 • 208