Spaces:
Sleeping
Sleeping
| # Evals usage | |
| Run the **SLM Agentic Benchmark Suite** (`slm-evals`) against a local HuggingFace model directory or Hub id. | |
| Benchmark details: [docs/benchmarks.md](docs/benchmarks.md). Package overview: [README.md](README.md). | |
| ## Install | |
| From the repo root: | |
| ```bash | |
| uv sync --group evals | |
| ``` | |
| For academic benchmarks (lm-evaluation-harness): | |
| ```bash | |
| uv sync --group lm-eval | |
| ``` | |
| This installs the `slm-evals` workspace package and registers the `slm-benchmark` and `slm-lm-eval` console scripts. | |
| ## Quick start | |
| ```bash | |
| # Two benchmarks, capped samples (good first run) | |
| uv run --package slm-evals slm-benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl tau_bench \ | |
| --max-samples 20 | |
| # All four benchmarks | |
| uv run --package slm-evals slm-benchmark \ | |
| --model ./models/finetuned/minicpm5-1b-lora \ | |
| --benchmarks all \ | |
| --max-samples 50 | |
| # Equivalent module invocation | |
| uv run --package slm-evals python -m slm_evals.run_benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl \ | |
| --max-samples 10 | |
| ``` | |
| ## Config-driven runs | |
| Copy and edit the template, then pass `--config`: | |
| ```bash | |
| cp research/evals/configs/experiment_001.yaml research/evals/configs/my_run.yaml | |
| # edit model_path, benchmarks, max_samples, overrides | |
| uv run --package slm-evals slm-benchmark \ | |
| --config research/evals/configs/my_run.yaml | |
| ``` | |
| When `--config` is set, **YAML values override CLI flags**. Use configs for reproducible experiment names and per-benchmark settings. | |
| ### Template fields | |
| | Key | Description | | |
| | --- | ----------- | | |
| | `model_path` | Local directory or HF Hub id | | |
| | `device` | `auto`, `cpu`, `cuda`, `cuda:0`, β¦ | | |
| | `dtype` | `float32`, `float16`, `bfloat16`, `int8`, `int4` | | |
| | `max_new_tokens` | Cap per generation (default 512) | | |
| | `temperature` | `0.0` = greedy (recommended for evals) | | |
| | `experiment_name` | Folder name under `output_dir` | | |
| | `output_dir` | Root for results (default `results`) | | |
| | `benchmarks` | List: `bfcl`, `tau_bench`, `gaia`, `swe_bench` | | |
| | `max_samples` | Cap per benchmark; omit or `null` for full split | | |
| | `benchmark_overrides` | Per-benchmark dict (see [docs/benchmarks.md](docs/benchmarks.md)) | | |
| --- | |
| ## CLI reference | |
| ``` | |
| slm-benchmark [OPTIONS] | |
| --list-benchmarks Show agentic benchmark keys and preset suites | |
| --model PATH Local HF dir or Hub id (required unless --config) | |
| --benchmarks NAMES bfcl tau_bench gaia swe_bench all (default: all) | |
| --config PATH YAML config (overrides other flags) | |
| --max-samples N Cap samples per benchmark | |
| --output-dir DIR Results root (default: ./results) | |
| --experiment-name TAG Run folder name (auto timestamp if omitted) | |
| --device MAP auto | cpu | cuda | cuda:0 | |
| --dtype TYPE float32 | float16 | bfloat16 | int8 | int4 | |
| --max-new-tokens N Default 512 | |
| --temperature T Default 0.0 | |
| ``` | |
| --- | |
| ## Results | |
| Each run writes to `<output_dir>/<experiment_name>/`: | |
| | File | Contents | | |
| | ---- | -------- | | |
| | `results.json` | Full structured payload (per-sample + aggregates) | | |
| | `results.csv` | One row per benchmark | | |
| | `report.md` | Human-readable summary | | |
| Example layout: | |
| ```text | |
| results/ | |
| βββ minicpm5-1b__bfcl-tau__v1/ | |
| βββ results.json | |
| βββ results.csv | |
| βββ report.md | |
| ``` | |
| `output_dir` is relative to **current working directory**. Run from repo root so paths stay predictable, or set an absolute `output_dir` in YAML. | |
| --- | |
| ## Per-benchmark tips | |
| ### BFCL (function calling) | |
| - Default: downloads from `gorilla-llm/Berkeley-Function-Calling-Leaderboard` | |
| - `strict: false` in YAML β fuzzy argument matching (better for small models) | |
| - Local JSONL: set `benchmark_overrides.bfcl.data_path` | |
| ### Ο-bench (multi-turn tools) | |
| - Domains: `retail`, `airline`, or `both` | |
| - `use_llm_user: false` β free rule-based user simulator (default) | |
| - `use_llm_user: true` β GPT-4o user agent (**API cost**) | |
| ### GAIA | |
| - Default split: `validation` (public) | |
| - `tool_mode: describe` β offline tool descriptions (no live web) | |
| - Level filter: `levels: [1, 2]` or `[1, 2, 3]` | |
| ### SWE-bench Verified | |
| - Default: lightweight patch-generation scoring (no Docker) | |
| - `full_eval: true` β official harness (`pip install swebench docker`) | |
| See [docs/benchmarks.md](docs/benchmarks.md) for scoring semantics. | |
| --- | |
| ## lm-evaluation-harness (`slm-lm-eval`) | |
| Run standard academic benchmarks (ARC, HellaSwag, PIQA, BoolQ, GSM8K) via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). | |
| Install: `uv sync --group lm-eval` | |
| Full profile guide: [docs/eval_profiles.md](docs/eval_profiles.md) | |
| ### Discover profiles and tasks | |
| ```bash | |
| # Claim-matched lm-eval profiles (reasoning, code, smoke, β¦) | |
| uv run --package slm-evals slm-lm-eval --list-profiles | |
| # Also show agentic suites + external benchmark notes | |
| uv run --package slm-evals slm-lm-eval --list-profiles-all | |
| # lm-eval task names | |
| uv run --package slm-evals slm-lm-eval --list-tasks | |
| # Agentic benchmarks (BFCL, Ο-bench, GAIA, SWE) | |
| uv run --package slm-evals slm-benchmark --list-benchmarks | |
| ``` | |
| ### Quick start | |
| ```bash | |
| # By profile name (recommended) | |
| uv run --package slm-evals slm-lm-eval \ | |
| --profile reasoning \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__reasoning-baseline | |
| # Smoke profile (25 samples) | |
| uv run --package slm-evals slm-lm-eval \ | |
| --profile smoke \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__smoke | |
| # LoRA adapter via preset (base + peft resolved automatically) | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_minicpm5.yaml \ | |
| --preset minicpm5-1b-lesson-lora \ | |
| --experiment-name minicpm5-1b-lora__v1 | |
| # Explicit base + adapter | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_smoke.yaml \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --adapter ./models/finetuned/minicpm5-1b-lora \ | |
| --experiment-name minicpm5-1b-lora__manual | |
| ``` | |
| ### Compare baseline vs candidate | |
| Use the **same config** for both runs; only change `--preset` / `--experiment-name`: | |
| ```bash | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_compare_study.yaml \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__baseline | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_compare_study.yaml \ | |
| --preset minicpm5-1b-lesson-lora \ | |
| --experiment-name minicpm5-1b-lora__v1 \ | |
| --compare-to results/lm_eval/minicpm5-1b__baseline/results.json | |
| ``` | |
| ### Config templates | |
| Catalog: `configs/eval_profiles.yaml` β maps **claim β profile β tasks**. | |
| | Profile (`--profile`) | Config file | Purpose | | |
| | --------------------- | ----------- | ------- | | |
| | `smoke` | `lm_eval_smoke.yaml` | Fast validation (`limit: 25`, 2 tasks) | | |
| | `reasoning` | `lm_eval_reasoning.yaml` | Math + commonsense (GSM8K, ARC, HellaSwag) | | |
| | `understanding` | `lm_eval_understanding.yaml` | NLU (BoolQ, PIQA, COPA, RTE) | | |
| | `code` | `lm_eval_code.yaml` | HumanEval + MBPP | | |
| | `instructions` | `lm_eval_instructions.yaml` | IFEval instruction following | | |
| | `general_slm` | `lm_eval_minicpm5.yaml` | Full ~1B SLM profile (6 tasks) | | |
| | `compare_study` | `lm_eval_compare_study.yaml` | Baseline vs finetune comparison defaults | | |
| | Key | Description | | |
| | --- | ----------- | | |
| | `tasks` | lm-eval task names (e.g. `arc_easy`, `gsm8k`) | | |
| | `num_fewshot` | Few-shot count (gsm8k may use task default 8) | | |
| | `limit` | Max samples per task; `null` = full split | | |
| | `seed` | Random seed (applied to all lm-eval RNGs) | | |
| | `batch_size` | `auto` or integer | | |
| | `device` | `auto`, `cpu`, `cuda`, β¦ | | |
| | `dtype` | `bfloat16`, `float16`, `int4`, β¦ | | |
| | `trust_remote_code` | Required for MiniCPM / Gemma presets | | |
| | `output_dir` | Root for runs (default `results/lm_eval`) | | |
| ### CLI reference | |
| ``` | |
| slm-lm-eval [OPTIONS] | |
| --list-profiles Show claim-matched profiles and example commands | |
| --list-profiles-all Include agentic suites and external benchmark notes | |
| --list-tasks List lm-eval task names (catalog fallback if not installed) | |
| --list-tasks-all Full lm-eval task list | |
| --profile NAME Shorthand for --config (reasoning, code, smoke, β¦) | |
| --config PATH YAML config (tasks, seed, limit, β¦) | |
| --preset KEY models.yaml preset (base, LoRA, merged) | |
| --model PATH HF Hub id or merged checkpoint dir | |
| --adapter PATH LoRA adapter (alternative to preset adapter_path) | |
| --tasks NAMES Override task list | |
| --num-fewshot N | |
| --limit N Cap samples per task | |
| --seed N | |
| --batch-size VALUE | |
| --device MAP | |
| --dtype TYPE | |
| --output-dir DIR Default: results/lm_eval | |
| --experiment-name TAG Run folder name | |
| --compare-to PATH Baseline results.json for delta table | |
| ``` | |
| ### Results | |
| Each run writes to `<output_dir>/<experiment_name>/`: | |
| | File | Contents | | |
| | ---- | -------- | | |
| | `results.json` | lm-eval native payload + `run_meta` | | |
| | `summary.md` | Task β metric table | | |
| | `run_meta.json` | Preset, base model, adapter, tasks, seed | | |
| | `comparison.md` | Delta table (when `--compare-to` set) | | |
| ### PEFT / LoRA | |
| lm-eval expects `pretrained=<base>,peft=<adapter>`. The preset resolver handles this for keys like `minicpm5-1b-lesson-lora`. Merged checkpoints use `--preset minicpm5-1b-lesson-merged` or `--model ./models/finetuned/...-merged`. | |
| --- | |
| ## Adding a custom benchmark | |
| 1. Create `src/slm_evals/benchmarks/my_bench.py` subclassing `BaseBenchmark`: | |
| - `load_dataset()` β list of sample dicts | |
| - `build_prompt(sample)` β prompt string | |
| - `evaluate_sample(sample, prediction)` β `{passed, score, note}` | |
| 2. Register in `src/slm_evals/run_benchmark.py` β `BENCHMARK_REGISTRY`. | |
| 3. Run: | |
| ```bash | |
| uv run --package slm-evals slm-benchmark \ | |
| --model ./my-model --benchmarks my_bench --max-samples 10 | |
| ``` | |
| --- | |
| ## Suggested workflows | |
| ### Smoke (CPU/GPU, ~5 min) | |
| ```bash | |
| uv run --package slm-evals slm-benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl \ | |
| --max-samples 5 \ | |
| --device cpu | |
| ``` | |
| ### Before / after fine-tune | |
| ```bash | |
| BASE=openbmb/MiniCPM5-1B | |
| ADAPTER=./models/finetuned/minicpm5-1b-lora | |
| for M in "$BASE" "$ADAPTER"; do | |
| uv run --package slm-evals slm-benchmark \ | |
| --model "$M" \ | |
| --benchmarks bfcl tau_bench \ | |
| --max-samples 100 \ | |
| --experiment-name "$(basename "$M")__bfcl-tau" | |
| done | |
| ``` | |
| ### Full experiment (YAML) | |
| Edit `configs/experiment_001.yaml` with your `model_path` and `experiment_name`, then: | |
| ```bash | |
| uv run --package slm-evals slm-benchmark \ | |
| --config research/evals/configs/experiment_001.yaml | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| | Symptom | Likely cause | Fix | | |
| | ------- | ------------ | --- | | |
| | `error: --model is required` | No `--config` and no `--model` | Pass one of them | | |
| | CUDA OOM | Model too large for VRAM | `--dtype int4` or `--device cpu` | | |
| | HF dataset 401 on GAIA test | Gated split | Use `split: validation` | | |
| | Ο-bench hangs / costs | LLM user enabled | Set `use_llm_user: false` | | |
| | Empty `results/` | Wrong cwd | Run from repo root or use absolute `output_dir` | | |
| | Import errors | Evals group not synced | `uv sync --group evals` | | |
| --- | |
| ## Entry points | |
| | Path | Role | | |
| | ---- | ---- | | |
| | `slm-benchmark` | Agentic benchmarks (BFCL, Ο-bench, GAIA, SWE) | | |
| | `slm-lm-eval` | Academic benchmarks via lm-evaluation-harness | | |
| | `python -m slm_evals.run_benchmark` | Same as `slm-benchmark` | | |
| | `python -m slm_evals.run_lm_eval` | Same as `slm-lm-eval` | | |
| | `research/evals/run_benchmark.py` | Thin shim for backward compatibility | | |