FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark Paper • 2409.19014 • Published Sep 24, 2024
Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge Paper • 2502.16457 • Published Feb 23, 2025 • 12
Overthinking Loops in Agents: A Structural Risk via MCP Tools Paper • 2602.14798 • Published Feb 16 • 1