Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published 10 days ago • 28
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published 10 days ago • 63
Deriving Character Logic from Storyline as Codified Decision Trees Paper • 2601.10080 • Published 11 days ago • 6
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors Paper • 2601.07226 • Published 14 days ago • 32
OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs Paper • 2601.01592 • Published 22 days ago • 12
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models Paper • 2601.03699 • Published 19 days ago • 6
UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement Paper • 2512.21185 • Published Dec 24, 2025 • 30
Are We on the Right Way to Assessing LLM-as-a-Judge? Paper • 2512.16041 • Published Dec 17, 2025 • 34
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management Paper • 2512.12967 • Published Dec 15, 2025 • 107
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models Paper • 2512.02556 • Published Dec 2, 2025 • 253