lesson-agent-dev / research /evals /docs /eval_profiles.md
MSG
Feat/last hour (#24)
bbff1ca
|
Raw
History Blame Contribute Delete
5.11 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Eval profiles guide

Match what your model is supposed to improve to the right benchmark profile, then run it with one command.

Catalog file: configs/eval_profiles.yaml

Quick commands

# See all lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles

# Include agentic suites (slm-benchmark) and external notes
uv run --package slm-evals slm-lm-eval --list-profiles-all

# List lm-eval task names available in the harness
uv run --package slm-evals slm-lm-eval --list-tasks

# Agentic benchmark keys (BFCL, τ-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks

# Run a profile by name
uv run --package slm-evals slm-lm-eval \
  --profile reasoning \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__reasoning-baseline

Install lm-eval extras first: uv sync --group lm-eval


Three eval systems in this repo

System CLI What it measures
Academic (lm-eval harness) slm-lm-eval ARC, GSM8K, HumanEval, IFEval, …
Agentic slm-benchmark Function calling, multi-turn tools, GAIA, SWE
Ensemble-specific jepa_harness, world_harness JEPA draft selection, world-model energy ranking

Use one profile per claim. Do not compare training loss to lm-eval accuracy.


Match your claim → profile

If you claim… Profile / suite Tool Tasks or benchmarks
Quick sanity check smoke slm-lm-eval arc_easy, hellaswag (limit 25)
Better reasoning reasoning slm-lm-eval gsm8k, arc_easy, arc_challenge, hellaswag
Better language understanding understanding slm-lm-eval boolq, piqa, copa, rte
Better code generation code slm-lm-eval humaneval, mbpp
Better instruction following instructions slm-lm-eval ifeval
Better French / translation french slm-lm-eval french_bench_xnli, belebele_fra_Latn, wmt14-en-fr, …
Better multilingual understanding multilingual slm-lm-eval xnli, xcopa, xwinograd
General ~1B SLM baseline general_slm slm-lm-eval 6-task mix (full splits)
Baseline vs finetune study compare_study slm-lm-eval Same 6 tasks, limit 100
Tool use / function calling agentic_tool_use slm-benchmark bfcl, tau_bench
End-to-end assistant tasks agentic_gaia slm-benchmark gaia
Real-world code repair agentic_code slm-benchmark swe_bench
JEPA / selector quality jepa_selector jepa_harness Domain QA + draft ablations
World model / planning world_model world_harness Energy-ranked drafts on QA
Better embeddings embeddings_mteb external (MTEB) Not in this repo
Chat quality (judge-based) chat_judge external MT-Bench, AlpacaEval

Profile YAML files

Profile key Config file
smoke lm_eval_smoke.yaml
reasoning lm_eval_reasoning.yaml
understanding lm_eval_understanding.yaml
code lm_eval_code.yaml
instructions lm_eval_instructions.yaml
french lm_eval_french.yaml
multilingual lm_eval_multilingual.yaml
general_slm lm_eval_minicpm5.yaml
compare_study lm_eval_compare_study.yaml

Equivalent to --profile reasoning:

uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_reasoning.yaml \
  --preset minicpm5-1b

Baseline vs candidate workflow

Use the same profile for both runs; only change preset and experiment name:

PROFILE=reasoning
BASE=minicpm5-1b__reasoning-baseline
CAND=minicpm5-1b-lora__reasoning

uv run --package slm-evals slm-lm-eval \
  --profile "$PROFILE" --preset minicpm5-1b --experiment-name "$BASE"

uv run --package slm-evals slm-lm-eval \
  --profile "$PROFILE" --preset minicpm5-1b-lesson-lora \
  --experiment-name "$CAND" \
  --compare-to "results/lm_eval/${BASE}/results.json"

Or after finetune:

uv run python research/finetune.py --preset minicpm5-1b --mode lora \
  --lm-eval-after \
  --lm-eval-config research/evals/configs/lm_eval_reasoning.yaml \
  --lm-eval-baseline minicpm5-1b

Results layout

slm-lm-evalresults/lm_eval/<experiment-name>/

File Contents
results.json Full lm-eval payload + run_meta
summary.md Task → metric table
run_meta.json Profile tasks, preset, seed
comparison.md Delta vs baseline (with --compare-to)

slm-benchmarkresults/<experiment-name>/ (results.json, results.csv, report.md)


Custom tasks

Override tasks on any profile:

uv run --package slm-evals slm-lm-eval \
  --profile smoke \
  --tasks gsm8k arc_easy \
  --preset minicpm5-1b

Browse all harness tasks: slm-lm-eval --list-tasks-all

See also: USAGE.md, benchmarks.md