Spaces:
Sleeping
Sleeping
File size: 5,332 Bytes
59e2c8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | # Benchmark reference
What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.
All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):
1. `load_dataset()` β fetch samples (Hub or local JSONL)
2. `build_prompt(sample)` β format the model input
3. `evaluate_sample(sample, prediction)` β return `{passed, score, note}`
4. `run()` β iterate, call `generate_fn`, aggregate scores (inherited)
---
## Summary table
| Key | Benchmark | Measures | Default dataset |
| --- | --------- | -------- | --------------- |
| `bfcl` | Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | `gorilla-llm/Berkeley-Function-Calling-Leaderboard` |
| `tau_bench` | Ο-bench | Multi-turn tool + user simulation | `ShishirPatil/tau-bench` |
| `gaia` | GAIA | End-to-end agent tasks (reasoning + tools) | `gaia-benchmark/GAIA` |
| `swe_bench` | SWE-bench Verified | Code patch generation for real issues | `princeton-nlp/SWE-bench_Verified` |
---
## BFCL (`bfcl`)
**Goal:** Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?
**Prompt style:** System message lists available functions; model must reply with only:
```json
{"name": "<function_name>", "arguments": {<key>: <value>}}
```
**Scoring:**
- Function name must match exactly
- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)
**Config overrides** (`benchmark_overrides.bfcl`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL instead of Hub download |
| `categories` | `[]` (all) | Filter BFCL categories |
| `strict` | `false` | Require perfect argument match |
**Implementation:** `src/slm_evals/benchmarks/bfcl.py`
---
## Ο-bench (`tau_bench`)
**Goal:** Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).
**Scoring:** Task success after up to `max_turns` exchanges β did the agent satisfy the user's underlying intent using the right tools?
**Config overrides** (`benchmark_overrides.tau_bench`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `domain` | `retail` | `retail`, `airline`, or `both` |
| `max_turns` | `15` | Dialogue cap |
| `use_llm_user` | `false` | `true` β GPT-4o user simulator (paid API) |
**Notes:**
- Default user simulator is rule-based β no API key required
- Small models often struggle on long horizons; start with `--max-samples 10`
**Implementation:** `src/slm_evals/benchmarks/tau_bench.py`
---
## GAIA (`gaia`)
**Goal:** Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).
**Prompt style:** Question + level metadata; tool availability depends on `tool_mode`.
**Scoring:** Normalized answer match against GAIA reference (with level breakdown in aggregates).
**Config overrides** (`benchmark_overrides.gaia`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `split` | `validation` | Public `validation`; `test` may need HF auth |
| `levels` | `[1, 2]` | Difficulty levels 1β3 |
| `tool_mode` | `describe` | `describe` = offline tool docs; `none` = no tools |
**Notes:**
- `tool_mode: describe` does not execute live tools β suitable for offline SLM scoring
- For live tool eval, extend `gaia.py` with real tool backends
**Implementation:** `src/slm_evals/benchmarks/gaia.py`
---
## SWE-bench Verified (`swe_bench`)
**Goal:** Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.
**Modes:**
| `full_eval` | Behavior |
| ----------- | -------- |
| `false` (default) | Generate patch text; score with lightweight heuristics / match checks β no Docker |
| `true` | Official SWE-bench harness β runs tests in containers (`swebench` + Docker) |
**Config overrides** (`benchmark_overrides.swe_bench`):
| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `full_eval` | `false` | Enable Docker harness |
| `context_lines` | `80` | Surrounding code context in prompt |
**Notes:**
- Full eval is slow and resource-heavy β use for final validation only
- SLMs typically score low; use `--max-samples` for iterative prompt tuning
**Implementation:** `src/slm_evals/benchmarks/swe_bench.py`
---
## Model loading
Shared loader: `src/slm_evals/utils/model_loader.py`
Returns a `model_bundle` dict passed to each benchmark:
- `generate_fn(prompt, max_new_tokens, temperature)` β unified generation interface
- `param_count` β billions of parameters (for reporting)
- Underlying `model` / `tokenizer` handles
Quantization (`int8`, `int4`) uses `bitsandbytes` when available.
---
## Reporter output schema
`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:
**Per benchmark in JSON:**
```json
{
"name": "bfcl",
"total": 100,
"passed": 42,
"score": 0.42,
"samples": [...]
}
```
**Aggregate fields:**
- `experiment_name`, `model_path`, `timestamp`
- `aggregate_score` β mean of benchmark scores
CSV columns: `benchmark`, `total`, `passed`, `score`.
|