Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Benchmark reference
What each benchmark in slm-evals measures, where data comes from, and how to configure overrides.
All benchmarks extend BaseBenchmark (src/slm_evals/benchmarks/base.py):
load_dataset()β fetch samples (Hub or local JSONL)build_prompt(sample)β format the model inputevaluate_sample(sample, prediction)β return{passed, score, note}run()β iterate, callgenerate_fn, aggregate scores (inherited)
Summary table
| Key | Benchmark | Measures | Default dataset |
|---|---|---|---|
bfcl |
Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | gorilla-llm/Berkeley-Function-Calling-Leaderboard |
tau_bench |
Ο-bench | Multi-turn tool + user simulation | ShishirPatil/tau-bench |
gaia |
GAIA | End-to-end agent tasks (reasoning + tools) | gaia-benchmark/GAIA |
swe_bench |
SWE-bench Verified | Code patch generation for real issues | princeton-nlp/SWE-bench_Verified |
BFCL (bfcl)
Goal: Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?
Prompt style: System message lists available functions; model must reply with only:
{"name": "<function_name>", "arguments": {<key>: <value>}}
Scoring:
- Function name must match exactly
- Arguments: exact match if
strict: true, fuzzy match ifstrict: false(recommended for SLMs)
Config overrides (benchmark_overrides.bfcl):
| Key | Default | Description |
|---|---|---|
data_path |
Hub | Local JSONL instead of Hub download |
categories |
[] (all) |
Filter BFCL categories |
strict |
false |
Require perfect argument match |
Implementation: src/slm_evals/benchmarks/bfcl.py
Ο-bench (tau_bench)
Goal: Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).
Scoring: Task success after up to max_turns exchanges β did the agent satisfy the user's underlying intent using the right tools?
Config overrides (benchmark_overrides.tau_bench):
| Key | Default | Description |
|---|---|---|
data_path |
Hub | Local JSONL |
domain |
retail |
retail, airline, or both |
max_turns |
15 |
Dialogue cap |
use_llm_user |
false |
true β GPT-4o user simulator (paid API) |
Notes:
- Default user simulator is rule-based β no API key required
- Small models often struggle on long horizons; start with
--max-samples 10
Implementation: src/slm_evals/benchmarks/tau_bench.py
GAIA (gaia)
Goal: Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).
Prompt style: Question + level metadata; tool availability depends on tool_mode.
Scoring: Normalized answer match against GAIA reference (with level breakdown in aggregates).
Config overrides (benchmark_overrides.gaia):
| Key | Default | Description |
|---|---|---|
data_path |
Hub | Local JSONL |
split |
validation |
Public validation; test may need HF auth |
levels |
[1, 2] |
Difficulty levels 1β3 |
tool_mode |
describe |
describe = offline tool docs; none = no tools |
Notes:
tool_mode: describedoes not execute live tools β suitable for offline SLM scoring- For live tool eval, extend
gaia.pywith real tool backends
Implementation: src/slm_evals/benchmarks/gaia.py
SWE-bench Verified (swe_bench)
Goal: Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.
Modes:
full_eval |
Behavior |
|---|---|
false (default) |
Generate patch text; score with lightweight heuristics / match checks β no Docker |
true |
Official SWE-bench harness β runs tests in containers (swebench + Docker) |
Config overrides (benchmark_overrides.swe_bench):
| Key | Default | Description |
|---|---|---|
data_path |
Hub | Local JSONL |
full_eval |
false |
Enable Docker harness |
context_lines |
80 |
Surrounding code context in prompt |
Notes:
- Full eval is slow and resource-heavy β use for final validation only
- SLMs typically score low; use
--max-samplesfor iterative prompt tuning
Implementation: src/slm_evals/benchmarks/swe_bench.py
Model loading
Shared loader: src/slm_evals/utils/model_loader.py
Returns a model_bundle dict passed to each benchmark:
generate_fn(prompt, max_new_tokens, temperature)β unified generation interfaceparam_countβ billions of parameters (for reporting)- Underlying
model/tokenizerhandles
Quantization (int8, int4) uses bitsandbytes when available.
Reporter output schema
Reporter.save() (src/slm_evals/utils/reporter.py) writes:
Per benchmark in JSON:
{
"name": "bfcl",
"total": 100,
"passed": 42,
"score": 0.42,
"samples": [...]
}
Aggregate fields:
experiment_name,model_path,timestampaggregate_scoreβ mean of benchmark scores
CSV columns: benchmark, total, passed, score.