Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 34
Running 232 AI2 WildBench Leaderboard (V2) 🦁 232 Display and explore a leaderboard of language models