Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring Paper • 2605.30834 • Published 5 days ago • 7
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers Paper • 2604.02648 • Published Apr 3 • 47
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published Feb 13 • 246