Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 8 days ago • 39
QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking Paper • 2605.09513 • Published May 10 • 2