Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /evals /docs /benchmarks.md

MSG

Merge pull request #4 from MSghais/experiment/small_model_building_testing

59e2c8a 17 days ago

preview code

Raw

History Blame Contribute Delete

5.33 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Benchmark reference

What each benchmark in slm-evals measures, where data comes from, and how to configure overrides.

All benchmarks extend BaseBenchmark (src/slm_evals/benchmarks/base.py):

load_dataset() — fetch samples (Hub or local JSONL)
build_prompt(sample) — format the model input
evaluate_sample(sample, prediction) — return {passed, score, note}
run() — iterate, call generate_fn, aggregate scores (inherited)

Summary table

Key	Benchmark	Measures	Default dataset
`bfcl`	Berkeley Function-Calling Leaderboard v4	Single-turn function call accuracy	`gorilla-llm/Berkeley-Function-Calling-Leaderboard`
`tau_bench`	τ-bench	Multi-turn tool + user simulation	`ShishirPatil/tau-bench`
`gaia`	GAIA	End-to-end agent tasks (reasoning + tools)	`gaia-benchmark/GAIA`
`swe_bench`	SWE-bench Verified	Code patch generation for real issues	`princeton-nlp/SWE-bench_Verified`

BFCL (`bfcl`)

Goal: Given a user request and a function schema, does the model emit a valid JSON tool call with the correct name and arguments?

Prompt style: System message lists available functions; model must reply with only:

{"name": "<function_name>", "arguments": {<key>: <value>}}

Scoring:

Function name must match exactly
Arguments: exact match if strict: true, fuzzy match if strict: false (recommended for SLMs)

Config overrides (benchmark_overrides.bfcl):

Key	Default	Description
`data_path`	Hub	Local JSONL instead of Hub download
`categories`	`[]` (all)	Filter BFCL categories
`strict`	`false`	Require perfect argument match

Implementation: src/slm_evals/benchmarks/bfcl.py

τ-bench (`tau_bench`)

Goal: Multi-turn dialogue where the model acts as a tool-using agent while a simulated user drives the conversation toward a goal (e.g. retail order change).

Scoring: Task success after up to max_turns exchanges — did the agent satisfy the user's underlying intent using the right tools?

Config overrides (benchmark_overrides.tau_bench):

Key	Default	Description
`data_path`	Hub	Local JSONL
`domain`	`retail`	`retail`, `airline`, or `both`
`max_turns`	`15`	Dialogue cap
`use_llm_user`	`false`	`true` → GPT-4o user simulator (paid API)

Notes:

Default user simulator is rule-based — no API key required
Small models often struggle on long horizons; start with --max-samples 10

Implementation: src/slm_evals/benchmarks/tau_bench.py

GAIA (`gaia`)

Goal: Real-world assistant tasks requiring reasoning, optional tool use, and concise final answers (web search, files, calculation, etc.).

Prompt style: Question + level metadata; tool availability depends on tool_mode.

Scoring: Normalized answer match against GAIA reference (with level breakdown in aggregates).

Config overrides (benchmark_overrides.gaia):

Key	Default	Description
`data_path`	Hub	Local JSONL
`split`	`validation`	Public `validation`; `test` may need HF auth
`levels`	`[1, 2]`	Difficulty levels 1–3
`tool_mode`	`describe`	`describe` = offline tool docs; `none` = no tools

Notes:

tool_mode: describe does not execute live tools — suitable for offline SLM scoring
For live tool eval, extend gaia.py with real tool backends

Implementation: src/slm_evals/benchmarks/gaia.py

SWE-bench Verified (`swe_bench`)

Goal: Given a GitHub issue and codebase context, produce a unified diff that fixes the bug.

Modes:

`full_eval`	Behavior
`false` (default)	Generate patch text; score with lightweight heuristics / match checks — no Docker
`true`	Official SWE-bench harness — runs tests in containers (`swebench` + Docker)

Config overrides (benchmark_overrides.swe_bench):

Key	Default	Description
`data_path`	Hub	Local JSONL
`full_eval`	`false`	Enable Docker harness
`context_lines`	`80`	Surrounding code context in prompt

Notes:

Full eval is slow and resource-heavy — use for final validation only
SLMs typically score low; use --max-samples for iterative prompt tuning

Implementation: src/slm_evals/benchmarks/swe_bench.py

Model loading

Shared loader: src/slm_evals/utils/model_loader.py

Returns a model_bundle dict passed to each benchmark:

generate_fn(prompt, max_new_tokens, temperature) — unified generation interface
param_count — billions of parameters (for reporting)
Underlying model / tokenizer handles

Quantization (int8, int4) uses bitsandbytes when available.

Reporter output schema

Reporter.save() (src/slm_evals/utils/reporter.py) writes:

Per benchmark in JSON:

{
  "name": "bfcl",
  "total": 100,
  "passed": 42,
  "score": 0.42,
  "samples": [...]
}

Aggregate fields:

experiment_name, model_path, timestamp
aggregate_score — mean of benchmark scores

CSV columns: benchmark, total, passed, score.

Benchmark reference

Summary table

BFCL (bfcl)

τ-bench (tau_bench)

GAIA (gaia)

SWE-bench Verified (swe_bench)

Model loading

Reporter output schema

BFCL (`bfcl`)

τ-bench (`tau_bench`)

GAIA (`gaia`)

SWE-bench Verified (`swe_bench`)