# Evals usage Run the **SLM Agentic Benchmark Suite** (`slm-evals`) against a local HuggingFace model directory or Hub id. Benchmark details: [docs/benchmarks.md](docs/benchmarks.md). Package overview: [README.md](README.md). ## Install From the repo root: ```bash uv sync --group evals ``` For academic benchmarks (lm-evaluation-harness): ```bash uv sync --group lm-eval ``` This installs the `slm-evals` workspace package and registers the `slm-benchmark` and `slm-lm-eval` console scripts. ## Quick start ```bash # Two benchmarks, capped samples (good first run) uv run --package slm-evals slm-benchmark \ --model openbmb/MiniCPM5-1B \ --benchmarks bfcl tau_bench \ --max-samples 20 # All four benchmarks uv run --package slm-evals slm-benchmark \ --model ./models/finetuned/minicpm5-1b-lora \ --benchmarks all \ --max-samples 50 # Equivalent module invocation uv run --package slm-evals python -m slm_evals.run_benchmark \ --model openbmb/MiniCPM5-1B \ --benchmarks bfcl \ --max-samples 10 ``` ## Config-driven runs Copy and edit the template, then pass `--config`: ```bash cp research/evals/configs/experiment_001.yaml research/evals/configs/my_run.yaml # edit model_path, benchmarks, max_samples, overrides uv run --package slm-evals slm-benchmark \ --config research/evals/configs/my_run.yaml ``` When `--config` is set, **YAML values override CLI flags**. Use configs for reproducible experiment names and per-benchmark settings. ### Template fields | Key | Description | | --- | ----------- | | `model_path` | Local directory or HF Hub id | | `device` | `auto`, `cpu`, `cuda`, `cuda:0`, … | | `dtype` | `float32`, `float16`, `bfloat16`, `int8`, `int4` | | `max_new_tokens` | Cap per generation (default 512) | | `temperature` | `0.0` = greedy (recommended for evals) | | `experiment_name` | Folder name under `output_dir` | | `output_dir` | Root for results (default `results`) | | `benchmarks` | List: `bfcl`, `tau_bench`, `gaia`, `swe_bench` | | `max_samples` | Cap per benchmark; omit or `null` for full split | | `benchmark_overrides` | Per-benchmark dict (see [docs/benchmarks.md](docs/benchmarks.md)) | --- ## CLI reference ``` slm-benchmark [OPTIONS] --list-benchmarks Show agentic benchmark keys and preset suites --model PATH Local HF dir or Hub id (required unless --config) --benchmarks NAMES bfcl tau_bench gaia swe_bench all (default: all) --config PATH YAML config (overrides other flags) --max-samples N Cap samples per benchmark --output-dir DIR Results root (default: ./results) --experiment-name TAG Run folder name (auto timestamp if omitted) --device MAP auto | cpu | cuda | cuda:0 --dtype TYPE float32 | float16 | bfloat16 | int8 | int4 --max-new-tokens N Default 512 --temperature T Default 0.0 ``` --- ## Results Each run writes to `//`: | File | Contents | | ---- | -------- | | `results.json` | Full structured payload (per-sample + aggregates) | | `results.csv` | One row per benchmark | | `report.md` | Human-readable summary | Example layout: ```text results/ └── minicpm5-1b__bfcl-tau__v1/ ├── results.json ├── results.csv └── report.md ``` `output_dir` is relative to **current working directory**. Run from repo root so paths stay predictable, or set an absolute `output_dir` in YAML. --- ## Per-benchmark tips ### BFCL (function calling) - Default: downloads from `gorilla-llm/Berkeley-Function-Calling-Leaderboard` - `strict: false` in YAML — fuzzy argument matching (better for small models) - Local JSONL: set `benchmark_overrides.bfcl.data_path` ### τ-bench (multi-turn tools) - Domains: `retail`, `airline`, or `both` - `use_llm_user: false` — free rule-based user simulator (default) - `use_llm_user: true` — GPT-4o user agent (**API cost**) ### GAIA - Default split: `validation` (public) - `tool_mode: describe` — offline tool descriptions (no live web) - Level filter: `levels: [1, 2]` or `[1, 2, 3]` ### SWE-bench Verified - Default: lightweight patch-generation scoring (no Docker) - `full_eval: true` — official harness (`pip install swebench docker`) See [docs/benchmarks.md](docs/benchmarks.md) for scoring semantics. --- ## lm-evaluation-harness (`slm-lm-eval`) Run standard academic benchmarks (ARC, HellaSwag, PIQA, BoolQ, GSM8K) via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Install: `uv sync --group lm-eval` Full profile guide: [docs/eval_profiles.md](docs/eval_profiles.md) ### Discover profiles and tasks ```bash # Claim-matched lm-eval profiles (reasoning, code, smoke, …) uv run --package slm-evals slm-lm-eval --list-profiles # Also show agentic suites + external benchmark notes uv run --package slm-evals slm-lm-eval --list-profiles-all # lm-eval task names uv run --package slm-evals slm-lm-eval --list-tasks # Agentic benchmarks (BFCL, τ-bench, GAIA, SWE) uv run --package slm-evals slm-benchmark --list-benchmarks ``` ### Quick start ```bash # By profile name (recommended) uv run --package slm-evals slm-lm-eval \ --profile reasoning \ --preset minicpm5-1b \ --experiment-name minicpm5-1b__reasoning-baseline # Smoke profile (25 samples) uv run --package slm-evals slm-lm-eval \ --profile smoke \ --preset minicpm5-1b \ --experiment-name minicpm5-1b__smoke # LoRA adapter via preset (base + peft resolved automatically) uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_minicpm5.yaml \ --preset minicpm5-1b-lesson-lora \ --experiment-name minicpm5-1b-lora__v1 # Explicit base + adapter uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_smoke.yaml \ --model openbmb/MiniCPM5-1B \ --adapter ./models/finetuned/minicpm5-1b-lora \ --experiment-name minicpm5-1b-lora__manual ``` ### Compare baseline vs candidate Use the **same config** for both runs; only change `--preset` / `--experiment-name`: ```bash uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_compare_study.yaml \ --preset minicpm5-1b \ --experiment-name minicpm5-1b__baseline uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_compare_study.yaml \ --preset minicpm5-1b-lesson-lora \ --experiment-name minicpm5-1b-lora__v1 \ --compare-to results/lm_eval/minicpm5-1b__baseline/results.json ``` ### Config templates Catalog: `configs/eval_profiles.yaml` — maps **claim → profile → tasks**. | Profile (`--profile`) | Config file | Purpose | | --------------------- | ----------- | ------- | | `smoke` | `lm_eval_smoke.yaml` | Fast validation (`limit: 25`, 2 tasks) | | `reasoning` | `lm_eval_reasoning.yaml` | Math + commonsense (GSM8K, ARC, HellaSwag) | | `understanding` | `lm_eval_understanding.yaml` | NLU (BoolQ, PIQA, COPA, RTE) | | `code` | `lm_eval_code.yaml` | HumanEval + MBPP | | `instructions` | `lm_eval_instructions.yaml` | IFEval instruction following | | `general_slm` | `lm_eval_minicpm5.yaml` | Full ~1B SLM profile (6 tasks) | | `compare_study` | `lm_eval_compare_study.yaml` | Baseline vs finetune comparison defaults | | Key | Description | | --- | ----------- | | `tasks` | lm-eval task names (e.g. `arc_easy`, `gsm8k`) | | `num_fewshot` | Few-shot count (gsm8k may use task default 8) | | `limit` | Max samples per task; `null` = full split | | `seed` | Random seed (applied to all lm-eval RNGs) | | `batch_size` | `auto` or integer | | `device` | `auto`, `cpu`, `cuda`, … | | `dtype` | `bfloat16`, `float16`, `int4`, … | | `trust_remote_code` | Required for MiniCPM / Gemma presets | | `output_dir` | Root for runs (default `results/lm_eval`) | ### CLI reference ``` slm-lm-eval [OPTIONS] --list-profiles Show claim-matched profiles and example commands --list-profiles-all Include agentic suites and external benchmark notes --list-tasks List lm-eval task names (catalog fallback if not installed) --list-tasks-all Full lm-eval task list --profile NAME Shorthand for --config (reasoning, code, smoke, …) --config PATH YAML config (tasks, seed, limit, …) --preset KEY models.yaml preset (base, LoRA, merged) --model PATH HF Hub id or merged checkpoint dir --adapter PATH LoRA adapter (alternative to preset adapter_path) --tasks NAMES Override task list --num-fewshot N --limit N Cap samples per task --seed N --batch-size VALUE --device MAP --dtype TYPE --output-dir DIR Default: results/lm_eval --experiment-name TAG Run folder name --compare-to PATH Baseline results.json for delta table ``` ### Results Each run writes to `//`: | File | Contents | | ---- | -------- | | `results.json` | lm-eval native payload + `run_meta` | | `summary.md` | Task → metric table | | `run_meta.json` | Preset, base model, adapter, tasks, seed | | `comparison.md` | Delta table (when `--compare-to` set) | ### PEFT / LoRA lm-eval expects `pretrained=,peft=`. The preset resolver handles this for keys like `minicpm5-1b-lesson-lora`. Merged checkpoints use `--preset minicpm5-1b-lesson-merged` or `--model ./models/finetuned/...-merged`. --- ## Adding a custom benchmark 1. Create `src/slm_evals/benchmarks/my_bench.py` subclassing `BaseBenchmark`: - `load_dataset()` → list of sample dicts - `build_prompt(sample)` → prompt string - `evaluate_sample(sample, prediction)` → `{passed, score, note}` 2. Register in `src/slm_evals/run_benchmark.py` → `BENCHMARK_REGISTRY`. 3. Run: ```bash uv run --package slm-evals slm-benchmark \ --model ./my-model --benchmarks my_bench --max-samples 10 ``` --- ## Suggested workflows ### Smoke (CPU/GPU, ~5 min) ```bash uv run --package slm-evals slm-benchmark \ --model openbmb/MiniCPM5-1B \ --benchmarks bfcl \ --max-samples 5 \ --device cpu ``` ### Before / after fine-tune ```bash BASE=openbmb/MiniCPM5-1B ADAPTER=./models/finetuned/minicpm5-1b-lora for M in "$BASE" "$ADAPTER"; do uv run --package slm-evals slm-benchmark \ --model "$M" \ --benchmarks bfcl tau_bench \ --max-samples 100 \ --experiment-name "$(basename "$M")__bfcl-tau" done ``` ### Full experiment (YAML) Edit `configs/experiment_001.yaml` with your `model_path` and `experiment_name`, then: ```bash uv run --package slm-evals slm-benchmark \ --config research/evals/configs/experiment_001.yaml ``` --- ## Troubleshooting | Symptom | Likely cause | Fix | | ------- | ------------ | --- | | `error: --model is required` | No `--config` and no `--model` | Pass one of them | | CUDA OOM | Model too large for VRAM | `--dtype int4` or `--device cpu` | | HF dataset 401 on GAIA test | Gated split | Use `split: validation` | | τ-bench hangs / costs | LLM user enabled | Set `use_llm_user: false` | | Empty `results/` | Wrong cwd | Run from repo root or use absolute `output_dir` | | Import errors | Evals group not synced | `uv sync --group evals` | --- ## Entry points | Path | Role | | ---- | ---- | | `slm-benchmark` | Agentic benchmarks (BFCL, τ-bench, GAIA, SWE) | | `slm-lm-eval` | Academic benchmarks via lm-evaluation-harness | | `python -m slm_evals.run_benchmark` | Same as `slm-benchmark` | | `python -m slm_evals.run_lm_eval` | Same as `slm-lm-eval` | | `research/evals/run_benchmark.py` | Thin shim for backward compatibility |