Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
SLM Agentic Benchmark Suite
A uv workspace package to evaluate local HuggingFace models against agentic and academic benchmarks.
Docs: USAGE.md (commands and workflows) Β· docs/benchmarks.md (per-benchmark reference) Β· ../USAGE.md (full research tree)
| Suite | CLI | What it measures |
|---|---|---|
| Agentic | slm-benchmark |
BFCL, Ο-bench, GAIA, SWE-bench |
| Academic | slm-lm-eval |
ARC, HellaSwag, GSM8K, β¦ (lm-evaluation-harness) |
Install
From the repo root:
uv sync --group evals
uv sync --group lm-eval # optional: slm-lm-eval academic benchmarks
Quickstart
# From repo root (recommended)
uv run --package slm-evals slm-benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# Or as a module
uv run --package slm-evals python -m slm_evals.run_benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# YAML config
uv run --package slm-evals slm-benchmark \
--config research/evals/configs/experiment_001.yaml
Project structure
research/evals/
βββ pyproject.toml
βββ configs/
β βββ experiment_001.yaml
βββ src/slm_evals/
β βββ run_benchmark.py
β βββ benchmarks/
β β βββ base.py
β β βββ bfcl.py
β β βββ tau_bench.py
β β βββ gaia.py
β β βββ swe_bench.py
β βββ utils/
β βββ model_loader.py
β βββ reporter.py
β βββ config_loader.py
βββ results/ # created at runtime (relative to cwd)
CLI reference
--model Path to local HF model dir (or Hub ID)
--benchmarks Space-separated: bfcl tau_bench gaia swe_bench all
--config YAML config file (overrides CLI flags)
--max-samples Cap samples per benchmark
--output-dir Results directory (default: ./results)
--experiment-name Tag for this run
--device auto | cpu | cuda | cuda:0
--dtype float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens Max tokens per generation (default: 512)
--temperature Sampling temp (default: 0.0 = greedy)
Adding a custom benchmark
- Create
src/slm_evals/benchmarks/my_bench.pyand subclassBaseBenchmark. - Register it in
src/slm_evals/run_benchmark.pyβBENCHMARK_REGISTRY. - Run:
uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench
Output formats
Results are written under <output-dir>/<experiment_name>/:
results.jsonβ full structured dumpresults.csvβ one row per benchmarkreport.mdβ human-readable summary
Notes
Ο-bench user simulator: Default is a lightweight rule-based simulator. Set use_llm_user: true in config for the GPT-4o user agent (API cost).
SWE-bench full eval: Set full_eval: true to run the official Docker harness (pip install swebench docker).
GAIA tools: Offline by default (tool_mode: describe). Wire real tools in gaia.py for live eval.