C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models Paper • 2305.08322 • Published May 15, 2023
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents Paper • 2401.13178 • Published Jan 24, 2024
When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search Paper • 2606.27669 • Published 5 days ago
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation Paper • 2508.05976 • Published Aug 8, 2025
Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations Paper • 2602.05885 • Published Feb 5 • 28
INDUS: Effective and Efficient Language Models for Scientific Applications Paper • 2405.10725 • Published May 17, 2024 • 35