lesson-agent-dev / research /evals /docs /eval_profiles.md
MSG
Feat/last hour (#24)
bbff1ca
|
Raw
History Blame Contribute Delete
5.11 kB
# Eval profiles guide
Match what your model is supposed to improve to the right benchmark profile, then run it with one command.
Catalog file: [`configs/eval_profiles.yaml`](../configs/eval_profiles.yaml)
## Quick commands
```bash
# See all lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles
# Include agentic suites (slm-benchmark) and external notes
uv run --package slm-evals slm-lm-eval --list-profiles-all
# List lm-eval task names available in the harness
uv run --package slm-evals slm-lm-eval --list-tasks
# Agentic benchmark keys (BFCL, τ-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks
# Run a profile by name
uv run --package slm-evals slm-lm-eval \
--profile reasoning \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__reasoning-baseline
```
Install lm-eval extras first: `uv sync --group lm-eval`
---
## Three eval systems in this repo
| System | CLI | What it measures |
| ------ | --- | ---------------- |
| Academic (lm-eval harness) | `slm-lm-eval` | ARC, GSM8K, HumanEval, IFEval, … |
| Agentic | `slm-benchmark` | Function calling, multi-turn tools, GAIA, SWE |
| Ensemble-specific | `jepa_harness`, `world_harness` | JEPA draft selection, world-model energy ranking |
Use **one profile per claim**. Do not compare training loss to lm-eval accuracy.
---
## Match your claim → profile
| If you claim… | Profile / suite | Tool | Tasks or benchmarks |
| ------------- | ----------------- | ---- | ------------------- |
| Quick sanity check | `smoke` | `slm-lm-eval` | `arc_easy`, `hellaswag` (limit 25) |
| Better reasoning | `reasoning` | `slm-lm-eval` | `gsm8k`, `arc_easy`, `arc_challenge`, `hellaswag` |
| Better language understanding | `understanding` | `slm-lm-eval` | `boolq`, `piqa`, `copa`, `rte` |
| Better code generation | `code` | `slm-lm-eval` | `humaneval`, `mbpp` |
| Better instruction following | `instructions` | `slm-lm-eval` | `ifeval` |
| Better French / translation | `french` | `slm-lm-eval` | `french_bench_xnli`, `belebele_fra_Latn`, `wmt14-en-fr`, … |
| Better multilingual understanding | `multilingual` | `slm-lm-eval` | `xnli`, `xcopa`, `xwinograd` |
| General ~1B SLM baseline | `general_slm` | `slm-lm-eval` | 6-task mix (full splits) |
| Baseline vs finetune study | `compare_study` | `slm-lm-eval` | Same 6 tasks, limit 100 |
| Tool use / function calling | `agentic_tool_use` | `slm-benchmark` | `bfcl`, `tau_bench` |
| End-to-end assistant tasks | `agentic_gaia` | `slm-benchmark` | `gaia` |
| Real-world code repair | `agentic_code` | `slm-benchmark` | `swe_bench` |
| JEPA / selector quality | `jepa_selector` | `jepa_harness` | Domain QA + draft ablations |
| World model / planning | `world_model` | `world_harness` | Energy-ranked drafts on QA |
| Better embeddings | `embeddings_mteb` | external (MTEB) | Not in this repo |
| Chat quality (judge-based) | `chat_judge` | external | MT-Bench, AlpacaEval |
---
## Profile YAML files
| Profile key | Config file |
| ----------- | ----------- |
| `smoke` | `lm_eval_smoke.yaml` |
| `reasoning` | `lm_eval_reasoning.yaml` |
| `understanding` | `lm_eval_understanding.yaml` |
| `code` | `lm_eval_code.yaml` |
| `instructions` | `lm_eval_instructions.yaml` |
| `french` | `lm_eval_french.yaml` |
| `multilingual` | `lm_eval_multilingual.yaml` |
| `general_slm` | `lm_eval_minicpm5.yaml` |
| `compare_study` | `lm_eval_compare_study.yaml` |
Equivalent to `--profile reasoning`:
```bash
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_reasoning.yaml \
--preset minicpm5-1b
```
---
## Baseline vs candidate workflow
Use the **same profile** for both runs; only change preset and experiment name:
```bash
PROFILE=reasoning
BASE=minicpm5-1b__reasoning-baseline
CAND=minicpm5-1b-lora__reasoning
uv run --package slm-evals slm-lm-eval \
--profile "$PROFILE" --preset minicpm5-1b --experiment-name "$BASE"
uv run --package slm-evals slm-lm-eval \
--profile "$PROFILE" --preset minicpm5-1b-lesson-lora \
--experiment-name "$CAND" \
--compare-to "results/lm_eval/${BASE}/results.json"
```
Or after finetune:
```bash
uv run python research/finetune.py --preset minicpm5-1b --mode lora \
--lm-eval-after \
--lm-eval-config research/evals/configs/lm_eval_reasoning.yaml \
--lm-eval-baseline minicpm5-1b
```
---
## Results layout
**slm-lm-eval**`results/lm_eval/<experiment-name>/`
| File | Contents |
| ---- | -------- |
| `results.json` | Full lm-eval payload + `run_meta` |
| `summary.md` | Task → metric table |
| `run_meta.json` | Profile tasks, preset, seed |
| `comparison.md` | Delta vs baseline (with `--compare-to`) |
**slm-benchmark**`results/<experiment-name>/` (`results.json`, `results.csv`, `report.md`)
---
## Custom tasks
Override tasks on any profile:
```bash
uv run --package slm-evals slm-lm-eval \
--profile smoke \
--tasks gsm8k arc_easy \
--preset minicpm5-1b
```
Browse all harness tasks: `slm-lm-eval --list-tasks-all`
See also: [USAGE.md](../USAGE.md), [benchmarks.md](benchmarks.md)