lesson-agent-dev / research /evals /docs /benchmarks.md
MSG
Merge pull request #4 from MSghais/experiment/small_model_building_testing
59e2c8a
|
Raw
History Blame Contribute Delete
5.33 kB
# Benchmark reference
What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.
All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):
1. `load_dataset()` β€” fetch samples (Hub or local JSONL)
2. `build_prompt(sample)` β€” format the model input
3. `evaluate_sample(sample, prediction)` β€” return `{passed, score, note}`
4. `run()` β€” iterate, call `generate_fn`, aggregate scores (inherited)
---
## Summary table
| Key | Benchmark | Measures | Default dataset |
| --- | --------- | -------- | --------------- |
| `bfcl` | Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | `gorilla-llm/Berkeley-Function-Calling-Leaderboard` |
| `tau_bench` | Ο„-bench | Multi-turn tool + user simulation | `ShishirPatil/tau-bench` |
| `gaia` | GAIA | End-to-end agent tasks (reasoning + tools) | `gaia-benchmark/GAIA` |
| `swe_bench` | SWE-bench Verified | Code patch generation for real issues | `princeton-nlp/SWE-bench_Verified` |
---
## BFCL (`bfcl`)
**Goal:** Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?
**Prompt style:** System message lists available functions; model must reply with only:
```json
{"name": "<function_name>", "arguments": {<key>: <value>}}
```
**Scoring:**
- Function name must match exactly
- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)
**Config overrides** (`benchmark_overrides.bfcl`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL instead of Hub download |
| `categories` | `[]` (all) | Filter BFCL categories |
| `strict` | `false` | Require perfect argument match |
**Implementation:** `src/slm_evals/benchmarks/bfcl.py`
---
## Ο„-bench (`tau_bench`)
**Goal:** Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).
**Scoring:** Task success after up to `max_turns` exchanges β€” did the agent satisfy the user's underlying intent using the right tools?
**Config overrides** (`benchmark_overrides.tau_bench`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `domain` | `retail` | `retail`, `airline`, or `both` |
| `max_turns` | `15` | Dialogue cap |
| `use_llm_user` | `false` | `true` β†’ GPT-4o user simulator (paid API) |
**Notes:**
- Default user simulator is rule-based β€” no API key required
- Small models often struggle on long horizons; start with `--max-samples 10`
**Implementation:** `src/slm_evals/benchmarks/tau_bench.py`
---
## GAIA (`gaia`)
**Goal:** Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).
**Prompt style:** Question + level metadata; tool availability depends on `tool_mode`.
**Scoring:** Normalized answer match against GAIA reference (with level breakdown in aggregates).
**Config overrides** (`benchmark_overrides.gaia`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `split` | `validation` | Public `validation`; `test` may need HF auth |
| `levels` | `[1, 2]` | Difficulty levels 1–3 |
| `tool_mode` | `describe` | `describe` = offline tool docs; `none` = no tools |
**Notes:**
- `tool_mode: describe` does not execute live tools β€” suitable for offline SLM scoring
- For live tool eval, extend `gaia.py` with real tool backends
**Implementation:** `src/slm_evals/benchmarks/gaia.py`
---
## SWE-bench Verified (`swe_bench`)
**Goal:** Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.
**Modes:**
| `full_eval` | Behavior |
| ----------- | -------- |
| `false` (default) | Generate patch text; score with lightweight heuristics / match checks β€” no Docker |
| `true` | Official SWE-bench harness β€” runs tests in containers (`swebench` + Docker) |
**Config overrides** (`benchmark_overrides.swe_bench`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `full_eval` | `false` | Enable Docker harness |
| `context_lines` | `80` | Surrounding code context in prompt |
**Notes:**
- Full eval is slow and resource-heavy β€” use for final validation only
- SLMs typically score low; use `--max-samples` for iterative prompt tuning
**Implementation:** `src/slm_evals/benchmarks/swe_bench.py`
---
## Model loading
Shared loader: `src/slm_evals/utils/model_loader.py`
Returns a `model_bundle` dict passed to each benchmark:
- `generate_fn(prompt, max_new_tokens, temperature)` β€” unified generation interface
- `param_count` β€” billions of parameters (for reporting)
- Underlying `model` / `tokenizer` handles
Quantization (`int8`, `int4`) uses `bitsandbytes` when available.
---
## Reporter output schema
`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:
**Per benchmark in JSON:**
```json
{
"name": "bfcl",
"total": 100,
"passed": 42,
"score": 0.42,
"samples": [...]
}
```
**Aggregate fields:**
- `experiment_name`, `model_path`, `timestamp`
- `aggregate_score` β€” mean of benchmark scores
CSV columns: `benchmark`, `total`, `passed`, `score`.