Spaces:
Sleeping
Sleeping
File size: 5,107 Bytes
59e2c8a bbff1ca 59e2c8a bbff1ca 59e2c8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | # Eval profiles guide
Match what your model is supposed to improve to the right benchmark profile, then run it with one command.
Catalog file: [`configs/eval_profiles.yaml`](../configs/eval_profiles.yaml)
## Quick commands
```bash
# See all lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles
# Include agentic suites (slm-benchmark) and external notes
uv run --package slm-evals slm-lm-eval --list-profiles-all
# List lm-eval task names available in the harness
uv run --package slm-evals slm-lm-eval --list-tasks
# Agentic benchmark keys (BFCL, τ-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks
# Run a profile by name
uv run --package slm-evals slm-lm-eval \
--profile reasoning \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__reasoning-baseline
```
Install lm-eval extras first: `uv sync --group lm-eval`
---
## Three eval systems in this repo
| System | CLI | What it measures |
| ------ | --- | ---------------- |
| Academic (lm-eval harness) | `slm-lm-eval` | ARC, GSM8K, HumanEval, IFEval, … |
| Agentic | `slm-benchmark` | Function calling, multi-turn tools, GAIA, SWE |
| Ensemble-specific | `jepa_harness`, `world_harness` | JEPA draft selection, world-model energy ranking |
Use **one profile per claim**. Do not compare training loss to lm-eval accuracy.
---
## Match your claim → profile
| If you claim… | Profile / suite | Tool | Tasks or benchmarks |
| ------------- | ----------------- | ---- | ------------------- |
| Quick sanity check | `smoke` | `slm-lm-eval` | `arc_easy`, `hellaswag` (limit 25) |
| Better reasoning | `reasoning` | `slm-lm-eval` | `gsm8k`, `arc_easy`, `arc_challenge`, `hellaswag` |
| Better language understanding | `understanding` | `slm-lm-eval` | `boolq`, `piqa`, `copa`, `rte` |
| Better code generation | `code` | `slm-lm-eval` | `humaneval`, `mbpp` |
| Better instruction following | `instructions` | `slm-lm-eval` | `ifeval` |
| Better French / translation | `french` | `slm-lm-eval` | `french_bench_xnli`, `belebele_fra_Latn`, `wmt14-en-fr`, … |
| Better multilingual understanding | `multilingual` | `slm-lm-eval` | `xnli`, `xcopa`, `xwinograd` |
| General ~1B SLM baseline | `general_slm` | `slm-lm-eval` | 6-task mix (full splits) |
| Baseline vs finetune study | `compare_study` | `slm-lm-eval` | Same 6 tasks, limit 100 |
| Tool use / function calling | `agentic_tool_use` | `slm-benchmark` | `bfcl`, `tau_bench` |
| End-to-end assistant tasks | `agentic_gaia` | `slm-benchmark` | `gaia` |
| Real-world code repair | `agentic_code` | `slm-benchmark` | `swe_bench` |
| JEPA / selector quality | `jepa_selector` | `jepa_harness` | Domain QA + draft ablations |
| World model / planning | `world_model` | `world_harness` | Energy-ranked drafts on QA |
| Better embeddings | `embeddings_mteb` | external (MTEB) | Not in this repo |
| Chat quality (judge-based) | `chat_judge` | external | MT-Bench, AlpacaEval |
---
## Profile YAML files
| Profile key | Config file |
| ----------- | ----------- |
| `smoke` | `lm_eval_smoke.yaml` |
| `reasoning` | `lm_eval_reasoning.yaml` |
| `understanding` | `lm_eval_understanding.yaml` |
| `code` | `lm_eval_code.yaml` |
| `instructions` | `lm_eval_instructions.yaml` |
| `french` | `lm_eval_french.yaml` |
| `multilingual` | `lm_eval_multilingual.yaml` |
| `general_slm` | `lm_eval_minicpm5.yaml` |
| `compare_study` | `lm_eval_compare_study.yaml` |
Equivalent to `--profile reasoning`:
```bash
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_reasoning.yaml \
--preset minicpm5-1b
```
---
## Baseline vs candidate workflow
Use the **same profile** for both runs; only change preset and experiment name:
```bash
PROFILE=reasoning
BASE=minicpm5-1b__reasoning-baseline
CAND=minicpm5-1b-lora__reasoning
uv run --package slm-evals slm-lm-eval \
--profile "$PROFILE" --preset minicpm5-1b --experiment-name "$BASE"
uv run --package slm-evals slm-lm-eval \
--profile "$PROFILE" --preset minicpm5-1b-lesson-lora \
--experiment-name "$CAND" \
--compare-to "results/lm_eval/${BASE}/results.json"
```
Or after finetune:
```bash
uv run python research/finetune.py --preset minicpm5-1b --mode lora \
--lm-eval-after \
--lm-eval-config research/evals/configs/lm_eval_reasoning.yaml \
--lm-eval-baseline minicpm5-1b
```
---
## Results layout
**slm-lm-eval** → `results/lm_eval/<experiment-name>/`
| File | Contents |
| ---- | -------- |
| `results.json` | Full lm-eval payload + `run_meta` |
| `summary.md` | Task → metric table |
| `run_meta.json` | Profile tasks, preset, seed |
| `comparison.md` | Delta vs baseline (with `--compare-to`) |
**slm-benchmark** → `results/<experiment-name>/` (`results.json`, `results.csv`, `report.md`)
---
## Custom tasks
Override tasks on any profile:
```bash
uv run --package slm-evals slm-lm-eval \
--profile smoke \
--tasks gsm8k arc_easy \
--preset minicpm5-1b
```
Browse all harness tasks: `slm-lm-eval --list-tasks-all`
See also: [USAGE.md](../USAGE.md), [benchmarks.md](benchmarks.md)
|