TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents Paper • 2602.07274 • Published 8 days ago • 196
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published Oct 16, 2024 • 47