File size: 5,332 Bytes
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# Benchmark reference

What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.

All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):

1. `load_dataset()` β€” fetch samples (Hub or local JSONL)
2. `build_prompt(sample)` β€” format the model input
3. `evaluate_sample(sample, prediction)` β€” return `{passed, score, note}`
4. `run()` β€” iterate, call `generate_fn`, aggregate scores (inherited)

---

## Summary table

| Key | Benchmark | Measures | Default dataset |
| --- | --------- | -------- | --------------- |
| `bfcl` | Berkeley Function-Calling Leaderboard v4 | Single-turn function call accuracy | `gorilla-llm/Berkeley-Function-Calling-Leaderboard` |
| `tau_bench` | Ο„-bench | Multi-turn tool + user simulation | `ShishirPatil/tau-bench` |
| `gaia` | GAIA | End-to-end agent tasks (reasoning + tools) | `gaia-benchmark/GAIA` |
| `swe_bench` | SWE-bench Verified | Code patch generation for real issues | `princeton-nlp/SWE-bench_Verified` |

---

## BFCL (`bfcl`)

**Goal:** Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?

**Prompt style:** System message lists available functions; model must reply with only:

```json
{"name": "<function_name>", "arguments": {<key>: <value>}}
```

**Scoring:**

- Function name must match exactly
- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)

**Config overrides** (`benchmark_overrides.bfcl`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL instead of Hub download |
| `categories` | `[]` (all) | Filter BFCL categories |
| `strict` | `false` | Require perfect argument match |

**Implementation:** `src/slm_evals/benchmarks/bfcl.py`

---

## Ο„-bench (`tau_bench`)

**Goal:** Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).

**Scoring:** Task success after up to `max_turns` exchanges β€” did the agent satisfy the user's underlying intent using the right tools?

**Config overrides** (`benchmark_overrides.tau_bench`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `domain` | `retail` | `retail`, `airline`, or `both` |
| `max_turns` | `15` | Dialogue cap |
| `use_llm_user` | `false` | `true` β†’ GPT-4o user simulator (paid API) |

**Notes:**

- Default user simulator is rule-based β€” no API key required
- Small models often struggle on long horizons; start with `--max-samples 10`

**Implementation:** `src/slm_evals/benchmarks/tau_bench.py`

---

## GAIA (`gaia`)

**Goal:** Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).

**Prompt style:** Question + level metadata; tool availability depends on `tool_mode`.

**Scoring:** Normalized answer match against GAIA reference (with level breakdown in aggregates).

**Config overrides** (`benchmark_overrides.gaia`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `split` | `validation` | Public `validation`; `test` may need HF auth |
| `levels` | `[1, 2]` | Difficulty levels 1–3 |
| `tool_mode` | `describe` | `describe` = offline tool docs; `none` = no tools |

**Notes:**

- `tool_mode: describe` does not execute live tools β€” suitable for offline SLM scoring
- For live tool eval, extend `gaia.py` with real tool backends

**Implementation:** `src/slm_evals/benchmarks/gaia.py`

---

## SWE-bench Verified (`swe_bench`)

**Goal:** Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.

**Modes:**

| `full_eval` | Behavior |
| ----------- | -------- |
| `false` (default) | Generate patch text; score with lightweight heuristics / match checks β€” no Docker |
| `true` | Official SWE-bench harness β€” runs tests in containers (`swebench` + Docker) |

**Config overrides** (`benchmark_overrides.swe_bench`):

| Key | Default | Description |
| --- | ------- | ----------- |
| `data_path` | Hub | Local JSONL |
| `full_eval` | `false` | Enable Docker harness |
| `context_lines` | `80` | Surrounding code context in prompt |

**Notes:**

- Full eval is slow and resource-heavy β€” use for final validation only
- SLMs typically score low; use `--max-samples` for iterative prompt tuning

**Implementation:** `src/slm_evals/benchmarks/swe_bench.py`

---

## Model loading

Shared loader: `src/slm_evals/utils/model_loader.py`

Returns a `model_bundle` dict passed to each benchmark:

- `generate_fn(prompt, max_new_tokens, temperature)` β€” unified generation interface
- `param_count` β€” billions of parameters (for reporting)
- Underlying `model` / `tokenizer` handles

Quantization (`int8`, `int4`) uses `bitsandbytes` when available.

---

## Reporter output schema

`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:

**Per benchmark in JSON:**

```json
{
  "name": "bfcl",
  "total": 100,
  "passed": 42,
  "score": 0.42,
  "samples": [...]
}
```

**Aggregate fields:**

- `experiment_name`, `model_path`, `timestamp`
- `aggregate_score` β€” mean of benchmark scores

CSV columns: `benchmark`, `total`, `passed`, `score`.