Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /evals /docs /benchmarks.md

MSG

Merge pull request #4 from MSghais/experiment/small_model_building_testing

59e2c8a 17 days ago

preview code

Raw

History Blame Contribute Delete

5.33 kB

	# Benchmark reference

	What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.

	All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):

	1. `load_dataset()` — fetch samples (Hub or local JSONL)
	2. `build_prompt(sample)` — format the model input
	3. `evaluate_sample(sample, prediction)` — return `{passed, score, note}`
	4. `run()` — iterate, call `generate_fn`, aggregate scores (inherited)

	---

	## Summary table

	\| Key \| Benchmark \| Measures \| Default dataset \|
	\| --- \| --------- \| -------- \| --------------- \|
	\| `bfcl` \| Berkeley Function-Calling Leaderboard v4 \| Single-turn function call accuracy \| `gorilla-llm/Berkeley-Function-Calling-Leaderboard` \|
	\| `tau_bench` \| τ-bench \| Multi-turn tool + user simulation \| `ShishirPatil/tau-bench` \|
	\| `gaia` \| GAIA \| End-to-end agent tasks (reasoning + tools) \| `gaia-benchmark/GAIA` \|
	\| `swe_bench` \| SWE-bench Verified \| Code patch generation for real issues \| `princeton-nlp/SWE-bench_Verified` \|

	---

	## BFCL (`bfcl`)

	Goal: Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?

	Prompt style: System message lists available functions; model must reply with only:

	```json
	{"name": "<function_name>", "arguments": {<key>: <value>}}
	```

	Scoring:

	- Function name must match exactly
	- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)

	Config overrides (`benchmark_overrides.bfcl`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL instead of Hub download \|
	\| `categories` \| `[]` (all) \| Filter BFCL categories \|
	\| `strict` \| `false` \| Require perfect argument match \|

	Implementation: `src/slm_evals/benchmarks/bfcl.py`

	---

	## τ-bench (`tau_bench`)

	Goal: Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).

	Scoring: Task success after up to `max_turns` exchanges — did the agent satisfy the user's underlying intent using the right tools?

	Config overrides (`benchmark_overrides.tau_bench`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `domain` \| `retail` \| `retail`, `airline`, or `both` \|
	\| `max_turns` \| `15` \| Dialogue cap \|
	\| `use_llm_user` \| `false` \| `true` → GPT-4o user simulator (paid API) \|

	Notes:

	- Default user simulator is rule-based — no API key required
	- Small models often struggle on long horizons; start with `--max-samples 10`

	Implementation: `src/slm_evals/benchmarks/tau_bench.py`

	---

	## GAIA (`gaia`)

	Goal: Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).

	Prompt style: Question + level metadata; tool availability depends on `tool_mode`.

	Scoring: Normalized answer match against GAIA reference (with level breakdown in aggregates).

	Config overrides (`benchmark_overrides.gaia`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `split` \| `validation` \| Public `validation`; `test` may need HF auth \|
	\| `levels` \| `[1, 2]` \| Difficulty levels 1–3 \|
	\| `tool_mode` \| `describe` \| `describe` = offline tool docs; `none` = no tools \|

	Notes:

	- `tool_mode: describe` does not execute live tools — suitable for offline SLM scoring
	- For live tool eval, extend `gaia.py` with real tool backends

	Implementation: `src/slm_evals/benchmarks/gaia.py`

	---

	## SWE-bench Verified (`swe_bench`)

	Goal: Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.

	Modes:

	\| `full_eval` \| Behavior \|
	\| ----------- \| -------- \|
	\| `false` (default) \| Generate patch text; score with lightweight heuristics / match checks — no Docker \|
	\| `true` \| Official SWE-bench harness — runs tests in containers (`swebench` + Docker) \|

	Config overrides (`benchmark_overrides.swe_bench`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `full_eval` \| `false` \| Enable Docker harness \|
	\| `context_lines` \| `80` \| Surrounding code context in prompt \|

	Notes:

	- Full eval is slow and resource-heavy — use for final validation only
	- SLMs typically score low; use `--max-samples` for iterative prompt tuning

	Implementation: `src/slm_evals/benchmarks/swe_bench.py`

	---

	## Model loading

	Shared loader: `src/slm_evals/utils/model_loader.py`

	Returns a `model_bundle` dict passed to each benchmark:

	- `generate_fn(prompt, max_new_tokens, temperature)` — unified generation interface
	- `param_count` — billions of parameters (for reporting)
	- Underlying `model` / `tokenizer` handles

	Quantization (`int8`, `int4`) uses `bitsandbytes` when available.

	---

	## Reporter output schema

	`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:

	Per benchmark in JSON:

	```json
	{
	"name": "bfcl",
	"total": 100,
	"passed": 42,
	"score": 0.42,
	"samples": [...]
	}
	```

	Aggregate fields:

	- `experiment_name`, `model_path`, `timestamp`
	- `aggregate_score` — mean of benchmark scores

	CSV columns: `benchmark`, `total`, `passed`, `score`.

	# Benchmark reference

	What each benchmark in `slm-evals` measures, where data comes from, and how to configure overrides.

	All benchmarks extend `BaseBenchmark` (`src/slm_evals/benchmarks/base.py`):

	1. `load_dataset()` — fetch samples (Hub or local JSONL)
	2. `build_prompt(sample)` — format the model input
	3. `evaluate_sample(sample, prediction)` — return `{passed, score, note}`
	4. `run()` — iterate, call `generate_fn`, aggregate scores (inherited)

	---

	## Summary table

	\| Key \| Benchmark \| Measures \| Default dataset \|
	\| --- \| --------- \| -------- \| --------------- \|
	\| `bfcl` \| Berkeley Function-Calling Leaderboard v4 \| Single-turn function call accuracy \| `gorilla-llm/Berkeley-Function-Calling-Leaderboard` \|
	\| `tau_bench` \| τ-bench \| Multi-turn tool + user simulation \| `ShishirPatil/tau-bench` \|
	\| `gaia` \| GAIA \| End-to-end agent tasks (reasoning + tools) \| `gaia-benchmark/GAIA` \|
	\| `swe_bench` \| SWE-bench Verified \| Code patch generation for real issues \| `princeton-nlp/SWE-bench_Verified` \|

	---

	## BFCL (`bfcl`)

	Goal: Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?

	Prompt style: System message lists available functions; model must reply with only:

	```json
	{"name": "<function_name>", "arguments": {<key>: <value>}}
	```

	Scoring:

	- Function name must match exactly
	- Arguments: exact match if `strict: true`, fuzzy match if `strict: false` (recommended for SLMs)

	Config overrides (`benchmark_overrides.bfcl`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL instead of Hub download \|
	\| `categories` \| `[]` (all) \| Filter BFCL categories \|
	\| `strict` \| `false` \| Require perfect argument match \|

	Implementation: `src/slm_evals/benchmarks/bfcl.py`

	---

	## τ-bench (`tau_bench`)

	Goal: Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).

	Scoring: Task success after up to `max_turns` exchanges — did the agent satisfy the user's underlying intent using the right tools?

	Config overrides (`benchmark_overrides.tau_bench`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `domain` \| `retail` \| `retail`, `airline`, or `both` \|
	\| `max_turns` \| `15` \| Dialogue cap \|
	\| `use_llm_user` \| `false` \| `true` → GPT-4o user simulator (paid API) \|

	Notes:

	- Default user simulator is rule-based — no API key required
	- Small models often struggle on long horizons; start with `--max-samples 10`

	Implementation: `src/slm_evals/benchmarks/tau_bench.py`

	---

	## GAIA (`gaia`)

	Goal: Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).

	Prompt style: Question + level metadata; tool availability depends on `tool_mode`.

	Scoring: Normalized answer match against GAIA reference (with level breakdown in aggregates).

	Config overrides (`benchmark_overrides.gaia`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `split` \| `validation` \| Public `validation`; `test` may need HF auth \|
	\| `levels` \| `[1, 2]` \| Difficulty levels 1–3 \|
	\| `tool_mode` \| `describe` \| `describe` = offline tool docs; `none` = no tools \|

	Notes:

	- `tool_mode: describe` does not execute live tools — suitable for offline SLM scoring
	- For live tool eval, extend `gaia.py` with real tool backends

	Implementation: `src/slm_evals/benchmarks/gaia.py`

	---

	## SWE-bench Verified (`swe_bench`)

	Goal: Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.

	Modes:

	\| `full_eval` \| Behavior \|
	\| ----------- \| -------- \|
	\| `false` (default) \| Generate patch text; score with lightweight heuristics / match checks — no Docker \|
	\| `true` \| Official SWE-bench harness — runs tests in containers (`swebench` + Docker) \|

	Config overrides (`benchmark_overrides.swe_bench`):

	\| Key \| Default \| Description \|
	\| --- \| ------- \| ----------- \|
	\| `data_path` \| Hub \| Local JSONL \|
	\| `full_eval` \| `false` \| Enable Docker harness \|
	\| `context_lines` \| `80` \| Surrounding code context in prompt \|

	Notes:

	- Full eval is slow and resource-heavy — use for final validation only
	- SLMs typically score low; use `--max-samples` for iterative prompt tuning

	Implementation: `src/slm_evals/benchmarks/swe_bench.py`

	---

	## Model loading

	Shared loader: `src/slm_evals/utils/model_loader.py`

	Returns a `model_bundle` dict passed to each benchmark:

	- `generate_fn(prompt, max_new_tokens, temperature)` — unified generation interface
	- `param_count` — billions of parameters (for reporting)
	- Underlying `model` / `tokenizer` handles

	Quantization (`int8`, `int4`) uses `bitsandbytes` when available.

	---

	## Reporter output schema

	`Reporter.save()` (`src/slm_evals/utils/reporter.py`) writes:

	Per benchmark in JSON:

	```json
	{
	"name": "bfcl",
	"total": 100,
	"passed": 42,
	"score": 0.42,
	"samples": [...]
	}
	```

	Aggregate fields:

	- `experiment_name`, `model_path`, `timestamp`
	- `aggregate_score` — mean of benchmark scores

	CSV columns: `benchmark`, `total`, `passed`, `score`.