Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published 10 days ago • 28
Running 231 AI2 WildBench Leaderboard (V2) 🦁 231 Display and explore a leaderboard of language models