File size: 5,107 Bytes
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbff1ca
 
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbff1ca
 
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Eval profiles guide

Match what your model is supposed to improve to the right benchmark profile, then run it with one command.

Catalog file: [`configs/eval_profiles.yaml`](../configs/eval_profiles.yaml)

## Quick commands

```bash
# See all lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles

# Include agentic suites (slm-benchmark) and external notes
uv run --package slm-evals slm-lm-eval --list-profiles-all

# List lm-eval task names available in the harness
uv run --package slm-evals slm-lm-eval --list-tasks

# Agentic benchmark keys (BFCL, τ-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks

# Run a profile by name
uv run --package slm-evals slm-lm-eval \
  --profile reasoning \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__reasoning-baseline
```

Install lm-eval extras first: `uv sync --group lm-eval`

---

## Three eval systems in this repo

| System | CLI | What it measures |
| ------ | --- | ---------------- |
| Academic (lm-eval harness) | `slm-lm-eval` | ARC, GSM8K, HumanEval, IFEval, … |
| Agentic | `slm-benchmark` | Function calling, multi-turn tools, GAIA, SWE |
| Ensemble-specific | `jepa_harness`, `world_harness` | JEPA draft selection, world-model energy ranking |

Use **one profile per claim**. Do not compare training loss to lm-eval accuracy.

---

## Match your claim → profile

| If you claim… | Profile / suite | Tool | Tasks or benchmarks |
| ------------- | ----------------- | ---- | ------------------- |
| Quick sanity check | `smoke` | `slm-lm-eval` | `arc_easy`, `hellaswag` (limit 25) |
| Better reasoning | `reasoning` | `slm-lm-eval` | `gsm8k`, `arc_easy`, `arc_challenge`, `hellaswag` |
| Better language understanding | `understanding` | `slm-lm-eval` | `boolq`, `piqa`, `copa`, `rte` |
| Better code generation | `code` | `slm-lm-eval` | `humaneval`, `mbpp` |
| Better instruction following | `instructions` | `slm-lm-eval` | `ifeval` |
| Better French / translation | `french` | `slm-lm-eval` | `french_bench_xnli`, `belebele_fra_Latn`, `wmt14-en-fr`, … |
| Better multilingual understanding | `multilingual` | `slm-lm-eval` | `xnli`, `xcopa`, `xwinograd` |
| General ~1B SLM baseline | `general_slm` | `slm-lm-eval` | 6-task mix (full splits) |
| Baseline vs finetune study | `compare_study` | `slm-lm-eval` | Same 6 tasks, limit 100 |
| Tool use / function calling | `agentic_tool_use` | `slm-benchmark` | `bfcl`, `tau_bench` |
| End-to-end assistant tasks | `agentic_gaia` | `slm-benchmark` | `gaia` |
| Real-world code repair | `agentic_code` | `slm-benchmark` | `swe_bench` |
| JEPA / selector quality | `jepa_selector` | `jepa_harness` | Domain QA + draft ablations |
| World model / planning | `world_model` | `world_harness` | Energy-ranked drafts on QA |
| Better embeddings | `embeddings_mteb` | external (MTEB) | Not in this repo |
| Chat quality (judge-based) | `chat_judge` | external | MT-Bench, AlpacaEval |

---

## Profile YAML files

| Profile key | Config file |
| ----------- | ----------- |
| `smoke` | `lm_eval_smoke.yaml` |
| `reasoning` | `lm_eval_reasoning.yaml` |
| `understanding` | `lm_eval_understanding.yaml` |
| `code` | `lm_eval_code.yaml` |
| `instructions` | `lm_eval_instructions.yaml` |
| `french` | `lm_eval_french.yaml` |
| `multilingual` | `lm_eval_multilingual.yaml` |
| `general_slm` | `lm_eval_minicpm5.yaml` |
| `compare_study` | `lm_eval_compare_study.yaml` |

Equivalent to `--profile reasoning`:

```bash
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_reasoning.yaml \
  --preset minicpm5-1b
```

---

## Baseline vs candidate workflow

Use the **same profile** for both runs; only change preset and experiment name:

```bash
PROFILE=reasoning
BASE=minicpm5-1b__reasoning-baseline
CAND=minicpm5-1b-lora__reasoning

uv run --package slm-evals slm-lm-eval \
  --profile "$PROFILE" --preset minicpm5-1b --experiment-name "$BASE"

uv run --package slm-evals slm-lm-eval \
  --profile "$PROFILE" --preset minicpm5-1b-lesson-lora \
  --experiment-name "$CAND" \
  --compare-to "results/lm_eval/${BASE}/results.json"
```

Or after finetune:

```bash
uv run python research/finetune.py --preset minicpm5-1b --mode lora \
  --lm-eval-after \
  --lm-eval-config research/evals/configs/lm_eval_reasoning.yaml \
  --lm-eval-baseline minicpm5-1b
```

---

## Results layout

**slm-lm-eval**`results/lm_eval/<experiment-name>/`

| File | Contents |
| ---- | -------- |
| `results.json` | Full lm-eval payload + `run_meta` |
| `summary.md` | Task → metric table |
| `run_meta.json` | Profile tasks, preset, seed |
| `comparison.md` | Delta vs baseline (with `--compare-to`) |

**slm-benchmark**`results/<experiment-name>/` (`results.json`, `results.csv`, `report.md`)

---

## Custom tasks

Override tasks on any profile:

```bash
uv run --package slm-evals slm-lm-eval \
  --profile smoke \
  --tasks gsm8k arc_easy \
  --preset minicpm5-1b
```

Browse all harness tasks: `slm-lm-eval --list-tasks-all`

See also: [USAGE.md](../USAGE.md), [benchmarks.md](benchmarks.md)