# Benchmark reference

What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.

All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):

1. `load_dataset()` — fetch samples (Hub or local JSONL)
2. `build_prompt(sample)` — format the model input
3. `evaluate_sample(sample, prediction)` — return `{passed, score, note}`
4. `run()` — iterate, call `generate_fn`, aggregate scores (inherited)

---

## Summary table

| Key | Benchmark | Measures | Default dataset |
| --- | --------- | -------- | --------------- |
| `bfcl` | Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | `gorilla-llm/Berkeley-Function-Calling-Leaderboard` |
| `tau_bench` | τ-bench | Multi-turn tool + user simulation | `ShishirPatil/tau-bench` |
| `gaia` | GAIA | End-to-end agent tasks (reasoning + tools) | `gaia-benchmark/GAIA` |
| `swe_bench` | SWE-bench Verified | Code patch generation for real issues | `princeton-nlp/SWE-bench_Verified` |

---

## BFCL (`bfcl`)

**Goal:** Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?

**Prompt style:** System message lists available functions; model must reply with only:

```json
{"name": "<function_name>", "arguments": {<key>: <value>}}
```

**Scoring:**

- Function name must match exactly
- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)

**Config overrides** (`benchmark_overrides.bfcl`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL instead of Hub download |
| `categories` | `[]` (all) | Filter BFCL categories |
| `strict` | `false` | Require perfect argument match |

**Implementation:** `src/slm_evals/benchmarks/bfcl.py`

---

## τ-bench (`tau_bench`)

**Goal:** Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).

**Scoring:** Task success after up to `max_turns` exchanges — did the agent satisfy the user's underlying intent using the right tools?

**Config overrides** (`benchmark_overrides.tau_bench`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `domain` | `retail` | `retail`, `airline`, or `both` |
| `max_turns` | `15` | Dialogue cap |
| `use_llm_user` | `false` | `true` → GPT-4o user simulator (paid API) |

**Notes:**

- Default user simulator is rule-based — no API key required
- Small models often struggle on long horizons; start with `--max-samples 10`

**Implementation:** `src/slm_evals/benchmarks/tau_bench.py`

---

## GAIA (`gaia`)

**Goal:** Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).

**Prompt style:** Question + level metadata; tool availability depends on `tool_mode`.

**Scoring:** Normalized answer match against GAIA reference (with level breakdown in aggregates).

**Config overrides** (`benchmark_overrides.gaia`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `split` | `validation` | Public `validation`; `test` may need HF auth |
| `levels` | `[1, 2]` | Difficulty levels 1–3 |
| `tool_mode` | `describe` | `describe` = offline tool docs; `none` = no tools |

**Notes:**

- `tool_mode: describe` does not execute live tools — suitable for offline SLM scoring
- For live tool eval, extend `gaia.py` with real tool backends

**Implementation:** `src/slm_evals/benchmarks/gaia.py`

---

## SWE-bench Verified (`swe_bench`)

**Goal:** Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.

**Modes:**

| `full_eval` | Behavior |
| ----------- | -------- |
| `false` (default) | Generate patch text; score with lightweight heuristics / match checks — no Docker |
| `true` | Official SWE-bench harness — runs tests in containers (`swebench` + Docker) |

**Config overrides** (`benchmark_overrides.swe_bench`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `full_eval` | `false` | Enable Docker harness |
| `context_lines` | `80` | Surrounding code context in prompt |

**Notes:**

- Full eval is slow and resource-heavy — use for final validation only
- SLMs typically score low; use `--max-samples` for iterative prompt tuning

**Implementation:** `src/slm_evals/benchmarks/swe_bench.py`

---

## Model loading

Shared loader: `src/slm_evals/utils/model_loader.py`

Returns a `model_bundle` dict passed to each benchmark:

- `generate_fn(prompt, max_new_tokens, temperature)` — unified generation interface
- `param_count` — billions of parameters (for reporting)
- Underlying `model` / `tokenizer` handles

Quantization (`int8`, `int4`) uses `bitsandbytes` when available.

---

## Reporter output schema

`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:

**Per benchmark in JSON:**

```json
{
  "name": "bfcl",
  "total": 100,
  "passed": 42,
  "score": 0.42,
  "samples": [...]
}
```

**Aggregate fields:**

- `experiment_name`, `model_path`, `timestamp`
- `aggregate_score` — mean of benchmark scores

CSV columns: `benchmark`, `total`, `passed`, `score`.