Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Evals usage
Run the SLM Agentic Benchmark Suite (slm-evals) against a local HuggingFace model directory or Hub id.
Benchmark details: docs/benchmarks.md. Package overview: README.md.
Install
From the repo root:
uv sync --group evals
For academic benchmarks (lm-evaluation-harness):
uv sync --group lm-eval
This installs the slm-evals workspace package and registers the slm-benchmark and slm-lm-eval console scripts.
Quick start
# Two benchmarks, capped samples (good first run)
uv run --package slm-evals slm-benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# All four benchmarks
uv run --package slm-evals slm-benchmark \
--model ./models/finetuned/minicpm5-1b-lora \
--benchmarks all \
--max-samples 50
# Equivalent module invocation
uv run --package slm-evals python -m slm_evals.run_benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl \
--max-samples 10
Config-driven runs
Copy and edit the template, then pass --config:
cp research/evals/configs/experiment_001.yaml research/evals/configs/my_run.yaml
# edit model_path, benchmarks, max_samples, overrides
uv run --package slm-evals slm-benchmark \
--config research/evals/configs/my_run.yaml
When --config is set, YAML values override CLI flags. Use configs for reproducible experiment names and per-benchmark settings.
Template fields
| Key | Description |
|---|---|
model_path |
Local directory or HF Hub id |
device |
auto, cpu, cuda, cuda:0, β¦ |
dtype |
float32, float16, bfloat16, int8, int4 |
max_new_tokens |
Cap per generation (default 512) |
temperature |
0.0 = greedy (recommended for evals) |
experiment_name |
Folder name under output_dir |
output_dir |
Root for results (default results) |
benchmarks |
List: bfcl, tau_bench, gaia, swe_bench |
max_samples |
Cap per benchmark; omit or null for full split |
benchmark_overrides |
Per-benchmark dict (see docs/benchmarks.md) |
CLI reference
slm-benchmark [OPTIONS]
--list-benchmarks Show agentic benchmark keys and preset suites
--model PATH Local HF dir or Hub id (required unless --config)
--benchmarks NAMES bfcl tau_bench gaia swe_bench all (default: all)
--config PATH YAML config (overrides other flags)
--max-samples N Cap samples per benchmark
--output-dir DIR Results root (default: ./results)
--experiment-name TAG Run folder name (auto timestamp if omitted)
--device MAP auto | cpu | cuda | cuda:0
--dtype TYPE float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens N Default 512
--temperature T Default 0.0
Results
Each run writes to <output_dir>/<experiment_name>/:
| File | Contents |
|---|---|
results.json |
Full structured payload (per-sample + aggregates) |
results.csv |
One row per benchmark |
report.md |
Human-readable summary |
Example layout:
results/
βββ minicpm5-1b__bfcl-tau__v1/
βββ results.json
βββ results.csv
βββ report.md
output_dir is relative to current working directory. Run from repo root so paths stay predictable, or set an absolute output_dir in YAML.
Per-benchmark tips
BFCL (function calling)
- Default: downloads from
gorilla-llm/Berkeley-Function-Calling-Leaderboard strict: falsein YAML β fuzzy argument matching (better for small models)- Local JSONL: set
benchmark_overrides.bfcl.data_path
Ο-bench (multi-turn tools)
- Domains:
retail,airline, orboth use_llm_user: falseβ free rule-based user simulator (default)use_llm_user: trueβ GPT-4o user agent (API cost)
GAIA
- Default split:
validation(public) tool_mode: describeβ offline tool descriptions (no live web)- Level filter:
levels: [1, 2]or[1, 2, 3]
SWE-bench Verified
- Default: lightweight patch-generation scoring (no Docker)
full_eval: trueβ official harness (pip install swebench docker)
See docs/benchmarks.md for scoring semantics.
lm-evaluation-harness (slm-lm-eval)
Run standard academic benchmarks (ARC, HellaSwag, PIQA, BoolQ, GSM8K) via EleutherAI lm-evaluation-harness.
Install: uv sync --group lm-eval
Full profile guide: docs/eval_profiles.md
Discover profiles and tasks
# Claim-matched lm-eval profiles (reasoning, code, smoke, β¦)
uv run --package slm-evals slm-lm-eval --list-profiles
# Also show agentic suites + external benchmark notes
uv run --package slm-evals slm-lm-eval --list-profiles-all
# lm-eval task names
uv run --package slm-evals slm-lm-eval --list-tasks
# Agentic benchmarks (BFCL, Ο-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks
Quick start
# By profile name (recommended)
uv run --package slm-evals slm-lm-eval \
--profile reasoning \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__reasoning-baseline
# Smoke profile (25 samples)
uv run --package slm-evals slm-lm-eval \
--profile smoke \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__smoke
# LoRA adapter via preset (base + peft resolved automatically)
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_minicpm5.yaml \
--preset minicpm5-1b-lesson-lora \
--experiment-name minicpm5-1b-lora__v1
# Explicit base + adapter
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_smoke.yaml \
--model openbmb/MiniCPM5-1B \
--adapter ./models/finetuned/minicpm5-1b-lora \
--experiment-name minicpm5-1b-lora__manual
Compare baseline vs candidate
Use the same config for both runs; only change --preset / --experiment-name:
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_compare_study.yaml \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__baseline
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_compare_study.yaml \
--preset minicpm5-1b-lesson-lora \
--experiment-name minicpm5-1b-lora__v1 \
--compare-to results/lm_eval/minicpm5-1b__baseline/results.json
Config templates
Catalog: configs/eval_profiles.yaml β maps claim β profile β tasks.
Profile (--profile) |
Config file | Purpose |
|---|---|---|
smoke |
lm_eval_smoke.yaml |
Fast validation (limit: 25, 2 tasks) |
reasoning |
lm_eval_reasoning.yaml |
Math + commonsense (GSM8K, ARC, HellaSwag) |
understanding |
lm_eval_understanding.yaml |
NLU (BoolQ, PIQA, COPA, RTE) |
code |
lm_eval_code.yaml |
HumanEval + MBPP |
instructions |
lm_eval_instructions.yaml |
IFEval instruction following |
general_slm |
lm_eval_minicpm5.yaml |
Full ~1B SLM profile (6 tasks) |
compare_study |
lm_eval_compare_study.yaml |
Baseline vs finetune comparison defaults |
| Key | Description |
|---|---|
tasks |
lm-eval task names (e.g. arc_easy, gsm8k) |
num_fewshot |
Few-shot count (gsm8k may use task default 8) |
limit |
Max samples per task; null = full split |
seed |
Random seed (applied to all lm-eval RNGs) |
batch_size |
auto or integer |
device |
auto, cpu, cuda, β¦ |
dtype |
bfloat16, float16, int4, β¦ |
trust_remote_code |
Required for MiniCPM / Gemma presets |
output_dir |
Root for runs (default results/lm_eval) |
CLI reference
slm-lm-eval [OPTIONS]
--list-profiles Show claim-matched profiles and example commands
--list-profiles-all Include agentic suites and external benchmark notes
--list-tasks List lm-eval task names (catalog fallback if not installed)
--list-tasks-all Full lm-eval task list
--profile NAME Shorthand for --config (reasoning, code, smoke, β¦)
--config PATH YAML config (tasks, seed, limit, β¦)
--preset KEY models.yaml preset (base, LoRA, merged)
--model PATH HF Hub id or merged checkpoint dir
--adapter PATH LoRA adapter (alternative to preset adapter_path)
--tasks NAMES Override task list
--num-fewshot N
--limit N Cap samples per task
--seed N
--batch-size VALUE
--device MAP
--dtype TYPE
--output-dir DIR Default: results/lm_eval
--experiment-name TAG Run folder name
--compare-to PATH Baseline results.json for delta table
Results
Each run writes to <output_dir>/<experiment_name>/:
| File | Contents |
|---|---|
results.json |
lm-eval native payload + run_meta |
summary.md |
Task β metric table |
run_meta.json |
Preset, base model, adapter, tasks, seed |
comparison.md |
Delta table (when --compare-to set) |
PEFT / LoRA
lm-eval expects pretrained=<base>,peft=<adapter>. The preset resolver handles this for keys like minicpm5-1b-lesson-lora. Merged checkpoints use --preset minicpm5-1b-lesson-merged or --model ./models/finetuned/...-merged.
Adding a custom benchmark
Create
src/slm_evals/benchmarks/my_bench.pysubclassingBaseBenchmark:load_dataset()β list of sample dictsbuild_prompt(sample)β prompt stringevaluate_sample(sample, prediction)β{passed, score, note}
Register in
src/slm_evals/run_benchmark.pyβBENCHMARK_REGISTRY.Run:
uv run --package slm-evals slm-benchmark \ --model ./my-model --benchmarks my_bench --max-samples 10
Suggested workflows
Smoke (CPU/GPU, ~5 min)
uv run --package slm-evals slm-benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl \
--max-samples 5 \
--device cpu
Before / after fine-tune
BASE=openbmb/MiniCPM5-1B
ADAPTER=./models/finetuned/minicpm5-1b-lora
for M in "$BASE" "$ADAPTER"; do
uv run --package slm-evals slm-benchmark \
--model "$M" \
--benchmarks bfcl tau_bench \
--max-samples 100 \
--experiment-name "$(basename "$M")__bfcl-tau"
done
Full experiment (YAML)
Edit configs/experiment_001.yaml with your model_path and experiment_name, then:
uv run --package slm-evals slm-benchmark \
--config research/evals/configs/experiment_001.yaml
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
error: --model is required |
No --config and no --model |
Pass one of them |
| CUDA OOM | Model too large for VRAM | --dtype int4 or --device cpu |
| HF dataset 401 on GAIA test | Gated split | Use split: validation |
| Ο-bench hangs / costs | LLM user enabled | Set use_llm_user: false |
Empty results/ |
Wrong cwd | Run from repo root or use absolute output_dir |
| Import errors | Evals group not synced | uv sync --group evals |
Entry points
| Path | Role |
|---|---|
slm-benchmark |
Agentic benchmarks (BFCL, Ο-bench, GAIA, SWE) |
slm-lm-eval |
Academic benchmarks via lm-evaluation-harness |
python -m slm_evals.run_benchmark |
Same as slm-benchmark |
python -m slm_evals.run_lm_eval |
Same as slm-lm-eval |
research/evals/run_benchmark.py |
Thin shim for backward compatibility |