Agent Benchmarks

akseljoonas 's Collections

good agentic datasets

Agent tuning

Agent Benchmarks

Smol Agents papers

updated Jun 9, 2025

Upvote

xw27/scibench

Viewer • Updated May 6, 2024 • 692 • 1.72k • 26

Note SciBench is a novel benchmark for college-level scientific problems sourced from instructional textbooks
google/frames-benchmark

Viewer • Updated Oct 15, 2024 • 824 • 18.3k • 262

Note 824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles. If we adapt this to a search agent with no RAG component it's super interesting.
gaia-benchmark/GAIA

Viewer • Updated Oct 28, 2025 • 932 • 35.1k • 704
HuggingFaceH4/MATH-500

Viewer • Updated Dec 15, 2025 • 500 • 125k • 317

Note can be nicely adapted to an agentic setting with code
smolagents/browse_comp

Viewer • Updated Apr 18, 2025 • 1.27k • 6.26k • 6

Note A simple and challenging benchmark that measures the ability of AI agents to locate hard-to-find information online
zai-org/ComplexFuncBench

Updated Jan 22, 2025 • 212 • 15

Note The ComplexFuncBench dataset encompass 1,000 complex function calling samples from five aspects: (1) Function calling with multi-step in single turn; (2) Function calling with user-provided constraints; (3) Function calling that requires parameter value reasoning from implicit information; (4) Function calling with long parameter values that exceed 500 tokens; and (5) Function calling with 128k long-context length.
galileo-ai/agent-leaderboard

Viewer • Updated Jul 16, 2025 • 1.28k • 69 • 33
Note Comprehensive evaluation across multiple domains and interaction types by leveraging diverse datasets: BFCL: Mathematics, Entertainment, Education, and Academic Domains τ-bench: Retail and Airline Industry Scenarios xLAM: Cross-domain Data Generation (21 Domains) ToolACE: API Interactions across 390 Domains
zai-org/SWE-Dev-train

Viewer • Updated Jul 9, 2025 • 20.1k • 977 • 21

Note Training traces that helped a 32B model achieve gpt-4o level performance on SWE-bench. OpenHands framework
SWE-Gym/SWE-Gym

Viewer • Updated May 10, 2025 • 2.44k • 88.5k • 25

Upvote