# Evals usage

Run the **SLM Agentic Benchmark Suite** (`slm-evals`) against a local HuggingFace model directory or Hub id.

Benchmark details: [docs/benchmarks.md](docs/benchmarks.md). Package overview: [README.md](README.md).

## Install

From the repo root:

```bash
uv sync --group evals
```

For academic benchmarks (lm-evaluation-harness):

```bash
uv sync --group lm-eval
```

This installs the `slm-evals` workspace package and registers the `slm-benchmark` and `slm-lm-eval` console scripts.

## Quick start

```bash
# Two benchmarks, capped samples (good first run)
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# All four benchmarks
uv run --package slm-evals slm-benchmark \
  --model ./models/finetuned/minicpm5-1b-lora \
  --benchmarks all \
  --max-samples 50

# Equivalent module invocation
uv run --package slm-evals python -m slm_evals.run_benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl \
  --max-samples 10
```

## Config-driven runs

Copy and edit the template, then pass `--config`:

```bash
cp research/evals/configs/experiment_001.yaml research/evals/configs/my_run.yaml
# edit model_path, benchmarks, max_samples, overrides

uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/my_run.yaml
```

When `--config` is set, **YAML values override CLI flags**. Use configs for reproducible experiment names and per-benchmark settings.

### Template fields

| Key | Description |
| --- | ----------- |
| `model_path` | Local directory or HF Hub id |
| `device` | `auto`, `cpu`, `cuda`, `cuda:0`, … |
| `dtype` | `float32`, `float16`, `bfloat16`, `int8`, `int4` |
| `max_new_tokens` | Cap per generation (default 512) |
| `temperature` | `0.0` = greedy (recommended for evals) |
| `experiment_name` | Folder name under `output_dir` |
| `output_dir` | Root for results (default `results`) |
| `benchmarks` | List: `bfcl`, `tau_bench`, `gaia`, `swe_bench` |
| `max_samples` | Cap per benchmark; omit or `null` for full split |
| `benchmark_overrides` | Per-benchmark dict (see [docs/benchmarks.md](docs/benchmarks.md)) |

---

## CLI reference

```
slm-benchmark [OPTIONS]

--list-benchmarks       Show agentic benchmark keys and preset suites
--model PATH            Local HF dir or Hub id (required unless --config)
--benchmarks NAMES      bfcl tau_bench gaia swe_bench all  (default: all)
--config PATH           YAML config (overrides other flags)
--max-samples N         Cap samples per benchmark
--output-dir DIR        Results root (default: ./results)
--experiment-name TAG   Run folder name (auto timestamp if omitted)
--device MAP            auto | cpu | cuda | cuda:0
--dtype TYPE            float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens N      Default 512
--temperature T         Default 0.0
```

---

## Results

Each run writes to `<output_dir>/<experiment_name>/`:

| File | Contents |
| ---- | -------- |
| `results.json` | Full structured payload (per-sample + aggregates) |
| `results.csv` | One row per benchmark |
| `report.md` | Human-readable summary |

Example layout:

```text
results/
└── minicpm5-1b__bfcl-tau__v1/
    ├── results.json
    ├── results.csv
    └── report.md
```

`output_dir` is relative to **current working directory**. Run from repo root so paths stay predictable, or set an absolute `output_dir` in YAML.

---

## Per-benchmark tips

### BFCL (function calling)

- Default: downloads from `gorilla-llm/Berkeley-Function-Calling-Leaderboard`
- `strict: false` in YAML — fuzzy argument matching (better for small models)
- Local JSONL: set `benchmark_overrides.bfcl.data_path`

### τ-bench (multi-turn tools)

- Domains: `retail`, `airline`, or `both`
- `use_llm_user: false` — free rule-based user simulator (default)
- `use_llm_user: true` — GPT-4o user agent (**API cost**)

### GAIA

- Default split: `validation` (public)
- `tool_mode: describe` — offline tool descriptions (no live web)
- Level filter: `levels: [1, 2]` or `[1, 2, 3]`

### SWE-bench Verified

- Default: lightweight patch-generation scoring (no Docker)
- `full_eval: true` — official harness (`pip install swebench docker`)

See [docs/benchmarks.md](docs/benchmarks.md) for scoring semantics.

---

## lm-evaluation-harness (`slm-lm-eval`)

Run standard academic benchmarks (ARC, HellaSwag, PIQA, BoolQ, GSM8K) via [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

Install: `uv sync --group lm-eval`

Full profile guide: [docs/eval_profiles.md](docs/eval_profiles.md)

### Discover profiles and tasks

```bash
# Claim-matched lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles

# Also show agentic suites + external benchmark notes
uv run --package slm-evals slm-lm-eval --list-profiles-all

# lm-eval task names
uv run --package slm-evals slm-lm-eval --list-tasks

# Agentic benchmarks (BFCL, τ-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks
```

### Quick start

```bash
# By profile name (recommended)
uv run --package slm-evals slm-lm-eval \
  --profile reasoning \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__reasoning-baseline

# Smoke profile (25 samples)
uv run --package slm-evals slm-lm-eval \
  --profile smoke \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__smoke

# LoRA adapter via preset (base + peft resolved automatically)
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_minicpm5.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__v1

# Explicit base + adapter
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_smoke.yaml \
  --model openbmb/MiniCPM5-1B \
  --adapter ./models/finetuned/minicpm5-1b-lora \
  --experiment-name minicpm5-1b-lora__manual
```

### Compare baseline vs candidate

Use the **same config** for both runs; only change `--preset` / `--experiment-name`:

```bash
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_compare_study.yaml \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__baseline

uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_compare_study.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__v1 \
  --compare-to results/lm_eval/minicpm5-1b__baseline/results.json
```

### Config templates

Catalog: `configs/eval_profiles.yaml` — maps **claim → profile → tasks**.

| Profile (`--profile`) | Config file | Purpose |
| --------------------- | ----------- | ------- |
| `smoke` | `lm_eval_smoke.yaml` | Fast validation (`limit: 25`, 2 tasks) |
| `reasoning` | `lm_eval_reasoning.yaml` | Math + commonsense (GSM8K, ARC, HellaSwag) |
| `understanding` | `lm_eval_understanding.yaml` | NLU (BoolQ, PIQA, COPA, RTE) |
| `code` | `lm_eval_code.yaml` | HumanEval + MBPP |
| `instructions` | `lm_eval_instructions.yaml` | IFEval instruction following |
| `general_slm` | `lm_eval_minicpm5.yaml` | Full ~1B SLM profile (6 tasks) |
| `compare_study` | `lm_eval_compare_study.yaml` | Baseline vs finetune comparison defaults |

| Key | Description |
| --- | ----------- |
| `tasks` | lm-eval task names (e.g. `arc_easy`, `gsm8k`) |
| `num_fewshot` | Few-shot count (gsm8k may use task default 8) |
| `limit` | Max samples per task; `null` = full split |
| `seed` | Random seed (applied to all lm-eval RNGs) |
| `batch_size` | `auto` or integer |
| `device` | `auto`, `cpu`, `cuda`, … |
| `dtype` | `bfloat16`, `float16`, `int4`, … |
| `trust_remote_code` | Required for MiniCPM / Gemma presets |
| `output_dir` | Root for runs (default `results/lm_eval`) |

### CLI reference

```
slm-lm-eval [OPTIONS]

--list-profiles         Show claim-matched profiles and example commands
--list-profiles-all     Include agentic suites and external benchmark notes
--list-tasks            List lm-eval task names (catalog fallback if not installed)
--list-tasks-all        Full lm-eval task list
--profile NAME          Shorthand for --config (reasoning, code, smoke, …)
--config PATH           YAML config (tasks, seed, limit, …)
--preset KEY            models.yaml preset (base, LoRA, merged)
--model PATH            HF Hub id or merged checkpoint dir
--adapter PATH          LoRA adapter (alternative to preset adapter_path)
--tasks NAMES           Override task list
--num-fewshot N
--limit N               Cap samples per task
--seed N
--batch-size VALUE
--device MAP
--dtype TYPE
--output-dir DIR        Default: results/lm_eval
--experiment-name TAG   Run folder name
--compare-to PATH       Baseline results.json for delta table
```

### Results

Each run writes to `<output_dir>/<experiment_name>/`:

| File | Contents |
| ---- | -------- |
| `results.json` | lm-eval native payload + `run_meta` |
| `summary.md` | Task → metric table |
| `run_meta.json` | Preset, base model, adapter, tasks, seed |
| `comparison.md` | Delta table (when `--compare-to` set) |

### PEFT / LoRA

lm-eval expects `pretrained=<base>,peft=<adapter>`. The preset resolver handles this for keys like `minicpm5-1b-lesson-lora`. Merged checkpoints use `--preset minicpm5-1b-lesson-merged` or `--model ./models/finetuned/...-merged`.

---

## Adding a custom benchmark

1. Create `src/slm_evals/benchmarks/my_bench.py` subclassing `BaseBenchmark`:
   - `load_dataset()` → list of sample dicts
   - `build_prompt(sample)` → prompt string
   - `evaluate_sample(sample, prediction)` → `{passed, score, note}`

2. Register in `src/slm_evals/run_benchmark.py` → `BENCHMARK_REGISTRY`.

3. Run:
   ```bash
   uv run --package slm-evals slm-benchmark \
     --model ./my-model --benchmarks my_bench --max-samples 10
   ```

---

## Suggested workflows

### Smoke (CPU/GPU, ~5 min)

```bash
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl \
  --max-samples 5 \
  --device cpu
```

### Before / after fine-tune

```bash
BASE=openbmb/MiniCPM5-1B
ADAPTER=./models/finetuned/minicpm5-1b-lora

for M in "$BASE" "$ADAPTER"; do
  uv run --package slm-evals slm-benchmark \
    --model "$M" \
    --benchmarks bfcl tau_bench \
    --max-samples 100 \
    --experiment-name "$(basename "$M")__bfcl-tau"
done
```

### Full experiment (YAML)

Edit `configs/experiment_001.yaml` with your `model_path` and `experiment_name`, then:

```bash
uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/experiment_001.yaml
```

---

## Troubleshooting

| Symptom | Likely cause | Fix |
| ------- | ------------ | --- |
| `error: --model is required` | No `--config` and no `--model` | Pass one of them |
| CUDA OOM | Model too large for VRAM | `--dtype int4` or `--device cpu` |
| HF dataset 401 on GAIA test | Gated split | Use `split: validation` |
| τ-bench hangs / costs | LLM user enabled | Set `use_llm_user: false` |
| Empty `results/` | Wrong cwd | Run from repo root or use absolute `output_dir` |
| Import errors | Evals group not synced | `uv sync --group evals` |

---

## Entry points

| Path | Role |
| ---- | ---- |
| `slm-benchmark` | Agentic benchmarks (BFCL, τ-bench, GAIA, SWE) |
| `slm-lm-eval` | Academic benchmarks via lm-evaluation-harness |
| `python -m slm_evals.run_benchmark` | Same as `slm-benchmark` |
| `python -m slm_evals.run_lm_eval` | Same as `slm-lm-eval` |
| `research/evals/run_benchmark.py` | Thin shim for backward compatibility |