Spaces:
Sleeping
Sleeping
| # Benchmark reference | |
| What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides. | |
| All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`): | |
| 1. `load_dataset()` β fetch samples (Hub or local JSONL) | |
| 2. `build_prompt(sample)` β format the model input | |
| 3. `evaluate_sample(sample, prediction)` β return `{passed, score, note}` | |
| 4. `run()` β iterate, call `generate_fn`, aggregate scores (inherited) | |
| --- | |
| ## Summary table | |
| | Key | Benchmark | Measures | Default dataset | | |
| | --- | --------- | -------- | --------------- | | |
| | `bfcl` | Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | `gorilla-llm/Berkeley-Function-Calling-Leaderboard` | | |
| | `tau_bench` | Ο-bench | Multi-turn tool + user simulation | `ShishirPatil/tau-bench` | | |
| | `gaia` | GAIA | End-to-end agent tasks (reasoning + tools) | `gaia-benchmark/GAIA` | | |
| | `swe_bench` | SWE-bench Verified | Code patch generation for real issues | `princeton-nlp/SWE-bench_Verified` | | |
| --- | |
| ## BFCL (`bfcl`) | |
| **Goal:** Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments? | |
| **Prompt style:** System message lists available functions; model must reply with only: | |
| ```json | |
| {"name": "<function_name>", "arguments": {<key>: <value>}} | |
| ``` | |
| **Scoring:** | |
| - Function name must match exactly | |
| - Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs) | |
| **Config overrides** (`benchmark_overrides.bfcl`): | |
| | Key | Default | Description | | |
| | --- | ------- | ----------- | | |
| | `data_path` | Hub | Local JSONL instead of Hub download | | |
| | `categories` | `[]` (all) | Filter BFCL categories | | |
| | `strict` | `false` | Require perfect argument match | | |
| **Implementation:** `src/slm_evals/benchmarks/bfcl.py` | |
| --- | |
| ## Ο-bench (`tau_bench`) | |
| **Goal:** Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change). | |
| **Scoring:** Task success after up to `max_turns` exchanges β did the agent satisfy the user's underlying intent using the right tools? | |
| **Config overrides** (`benchmark_overrides.tau_bench`): | |
| | Key | Default | Description | | |
| | --- | ------- | ----------- | | |
| | `data_path` | Hub | Local JSONL | | |
| | `domain` | `retail` | `retail`, `airline`, or `both` | | |
| | `max_turns` | `15` | Dialogue cap | | |
| | `use_llm_user` | `false` | `true` β GPT-4o user simulator (paid API) | | |
| **Notes:** | |
| - Default user simulator is rule-based β no API key required | |
| - Small models often struggle on long horizons; start with `--max-samples 10` | |
| **Implementation:** `src/slm_evals/benchmarks/tau_bench.py` | |
| --- | |
| ## GAIA (`gaia`) | |
| **Goal:** Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.). | |
| **Prompt style:** Question + level metadata; tool availability depends on `tool_mode`. | |
| **Scoring:** Normalized answer match against GAIA reference (with level breakdown in aggregates). | |
| **Config overrides** (`benchmark_overrides.gaia`): | |
| | Key | Default | Description | | |
| | --- | ------- | ----------- | | |
| | `data_path` | Hub | Local JSONL | | |
| | `split` | `validation` | Public `validation`; `test` may need HF auth | | |
| | `levels` | `[1, 2]` | Difficulty levels 1β3 | | |
| | `tool_mode` | `describe` | `describe` = offline tool docs; `none` = no tools | | |
| **Notes:** | |
| - `tool_mode: describe` does not execute live tools β suitable for offline SLM scoring | |
| - For live tool eval, extend `gaia.py` with real tool backends | |
| **Implementation:** `src/slm_evals/benchmarks/gaia.py` | |
| --- | |
| ## SWE-bench Verified (`swe_bench`) | |
| **Goal:** Given a GitHub issue and codebase context, produce a unified diff that fixes the bug. | |
| **Modes:** | |
| | `full_eval` | Behavior | | |
| | ----------- | -------- | | |
| | `false` (default) | Generate patch text; score with lightweight heuristics / match checks β no Docker | | |
| | `true` | Official SWE-bench harness β runs tests in containers (`swebench` + Docker) | | |
| **Config overrides** (`benchmark_overrides.swe_bench`): | |
| | Key | Default | Description | | |
| | --- | ------- | ----------- | | |
| | `data_path` | Hub | Local JSONL | | |
| | `full_eval` | `false` | Enable Docker harness | | |
| | `context_lines` | `80` | Surrounding code context in prompt | | |
| **Notes:** | |
| - Full eval is slow and resource-heavy β use for final validation only | |
| - SLMs typically score low; use `--max-samples` for iterative prompt tuning | |
| **Implementation:** `src/slm_evals/benchmarks/swe_bench.py` | |
| --- | |
| ## Model loading | |
| Shared loader: `src/slm_evals/utils/model_loader.py` | |
| Returns a `model_bundle` dict passed to each benchmark: | |
| - `generate_fn(prompt, max_new_tokens, temperature)` β unified generation interface | |
| - `param_count` β billions of parameters (for reporting) | |
| - Underlying `model` / `tokenizer` handles | |
| Quantization (`int8`, `int4`) uses `bitsandbytes` when available. | |
| --- | |
| ## Reporter output schema | |
| `Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes: | |
| **Per benchmark in JSON:** | |
| ```json | |
| { | |
| "name": "bfcl", | |
| "total": 100, | |
| "passed": 42, | |
| "score": 0.42, | |
| "samples": [...] | |
| } | |
| ``` | |
| **Aggregate fields:** | |
| - `experiment_name`, `model_path`, `timestamp` | |
| - `aggregate_score` β mean of benchmark scores | |
| CSV columns: `benchmark`, `total`, `passed`, `score`. | |