MSG
Feat/last sprint (#12)
871f869
|
Raw
History Blame Contribute Delete
11.5 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Evals usage

Run the SLM Agentic Benchmark Suite (slm-evals) against a local HuggingFace model directory or Hub id.

Benchmark details: docs/benchmarks.md. Package overview: README.md.

Install

From the repo root:

uv sync --group evals

For academic benchmarks (lm-evaluation-harness):

uv sync --group lm-eval

This installs the slm-evals workspace package and registers the slm-benchmark and slm-lm-eval console scripts.

Quick start

# Two benchmarks, capped samples (good first run)
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# All four benchmarks
uv run --package slm-evals slm-benchmark \
  --model ./models/finetuned/minicpm5-1b-lora \
  --benchmarks all \
  --max-samples 50

# Equivalent module invocation
uv run --package slm-evals python -m slm_evals.run_benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl \
  --max-samples 10

Config-driven runs

Copy and edit the template, then pass --config:

cp research/evals/configs/experiment_001.yaml research/evals/configs/my_run.yaml
# edit model_path, benchmarks, max_samples, overrides

uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/my_run.yaml

When --config is set, YAML values override CLI flags. Use configs for reproducible experiment names and per-benchmark settings.

Template fields

Key Description
model_path Local directory or HF Hub id
device auto, cpu, cuda, cuda:0, …
dtype float32, float16, bfloat16, int8, int4
max_new_tokens Cap per generation (default 512)
temperature 0.0 = greedy (recommended for evals)
experiment_name Folder name under output_dir
output_dir Root for results (default results)
benchmarks List: bfcl, tau_bench, gaia, swe_bench
max_samples Cap per benchmark; omit or null for full split
benchmark_overrides Per-benchmark dict (see docs/benchmarks.md)

CLI reference

slm-benchmark [OPTIONS]

--list-benchmarks       Show agentic benchmark keys and preset suites
--model PATH            Local HF dir or Hub id (required unless --config)
--benchmarks NAMES      bfcl tau_bench gaia swe_bench all  (default: all)
--config PATH           YAML config (overrides other flags)
--max-samples N         Cap samples per benchmark
--output-dir DIR        Results root (default: ./results)
--experiment-name TAG   Run folder name (auto timestamp if omitted)
--device MAP            auto | cpu | cuda | cuda:0
--dtype TYPE            float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens N      Default 512
--temperature T         Default 0.0

Results

Each run writes to <output_dir>/<experiment_name>/:

File Contents
results.json Full structured payload (per-sample + aggregates)
results.csv One row per benchmark
report.md Human-readable summary

Example layout:

results/
└── minicpm5-1b__bfcl-tau__v1/
    β”œβ”€β”€ results.json
    β”œβ”€β”€ results.csv
    └── report.md

output_dir is relative to current working directory. Run from repo root so paths stay predictable, or set an absolute output_dir in YAML.


Per-benchmark tips

BFCL (function calling)

  • Default: downloads from gorilla-llm/Berkeley-Function-Calling-Leaderboard
  • strict: false in YAML β€” fuzzy argument matching (better for small models)
  • Local JSONL: set benchmark_overrides.bfcl.data_path

Ο„-bench (multi-turn tools)

  • Domains: retail, airline, or both
  • use_llm_user: false β€” free rule-based user simulator (default)
  • use_llm_user: true β€” GPT-4o user agent (API cost)

GAIA

  • Default split: validation (public)
  • tool_mode: describe β€” offline tool descriptions (no live web)
  • Level filter: levels: [1, 2] or [1, 2, 3]

SWE-bench Verified

  • Default: lightweight patch-generation scoring (no Docker)
  • full_eval: true β€” official harness (pip install swebench docker)

See docs/benchmarks.md for scoring semantics.


lm-evaluation-harness (slm-lm-eval)

Run standard academic benchmarks (ARC, HellaSwag, PIQA, BoolQ, GSM8K) via EleutherAI lm-evaluation-harness.

Install: uv sync --group lm-eval

Full profile guide: docs/eval_profiles.md

Discover profiles and tasks

# Claim-matched lm-eval profiles (reasoning, code, smoke, …)
uv run --package slm-evals slm-lm-eval --list-profiles

# Also show agentic suites + external benchmark notes
uv run --package slm-evals slm-lm-eval --list-profiles-all

# lm-eval task names
uv run --package slm-evals slm-lm-eval --list-tasks

# Agentic benchmarks (BFCL, Ο„-bench, GAIA, SWE)
uv run --package slm-evals slm-benchmark --list-benchmarks

Quick start

# By profile name (recommended)
uv run --package slm-evals slm-lm-eval \
  --profile reasoning \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__reasoning-baseline

# Smoke profile (25 samples)
uv run --package slm-evals slm-lm-eval \
  --profile smoke \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__smoke

# LoRA adapter via preset (base + peft resolved automatically)
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_minicpm5.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__v1

# Explicit base + adapter
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_smoke.yaml \
  --model openbmb/MiniCPM5-1B \
  --adapter ./models/finetuned/minicpm5-1b-lora \
  --experiment-name minicpm5-1b-lora__manual

Compare baseline vs candidate

Use the same config for both runs; only change --preset / --experiment-name:

uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_compare_study.yaml \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__baseline

uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_compare_study.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__v1 \
  --compare-to results/lm_eval/minicpm5-1b__baseline/results.json

Config templates

Catalog: configs/eval_profiles.yaml β€” maps claim β†’ profile β†’ tasks.

Profile (--profile) Config file Purpose
smoke lm_eval_smoke.yaml Fast validation (limit: 25, 2 tasks)
reasoning lm_eval_reasoning.yaml Math + commonsense (GSM8K, ARC, HellaSwag)
understanding lm_eval_understanding.yaml NLU (BoolQ, PIQA, COPA, RTE)
code lm_eval_code.yaml HumanEval + MBPP
instructions lm_eval_instructions.yaml IFEval instruction following
general_slm lm_eval_minicpm5.yaml Full ~1B SLM profile (6 tasks)
compare_study lm_eval_compare_study.yaml Baseline vs finetune comparison defaults
Key Description
tasks lm-eval task names (e.g. arc_easy, gsm8k)
num_fewshot Few-shot count (gsm8k may use task default 8)
limit Max samples per task; null = full split
seed Random seed (applied to all lm-eval RNGs)
batch_size auto or integer
device auto, cpu, cuda, …
dtype bfloat16, float16, int4, …
trust_remote_code Required for MiniCPM / Gemma presets
output_dir Root for runs (default results/lm_eval)

CLI reference

slm-lm-eval [OPTIONS]

--list-profiles         Show claim-matched profiles and example commands
--list-profiles-all     Include agentic suites and external benchmark notes
--list-tasks            List lm-eval task names (catalog fallback if not installed)
--list-tasks-all        Full lm-eval task list
--profile NAME          Shorthand for --config (reasoning, code, smoke, …)
--config PATH           YAML config (tasks, seed, limit, …)
--preset KEY            models.yaml preset (base, LoRA, merged)
--model PATH            HF Hub id or merged checkpoint dir
--adapter PATH          LoRA adapter (alternative to preset adapter_path)
--tasks NAMES           Override task list
--num-fewshot N
--limit N               Cap samples per task
--seed N
--batch-size VALUE
--device MAP
--dtype TYPE
--output-dir DIR        Default: results/lm_eval
--experiment-name TAG   Run folder name
--compare-to PATH       Baseline results.json for delta table

Results

Each run writes to <output_dir>/<experiment_name>/:

File Contents
results.json lm-eval native payload + run_meta
summary.md Task β†’ metric table
run_meta.json Preset, base model, adapter, tasks, seed
comparison.md Delta table (when --compare-to set)

PEFT / LoRA

lm-eval expects pretrained=<base>,peft=<adapter>. The preset resolver handles this for keys like minicpm5-1b-lesson-lora. Merged checkpoints use --preset minicpm5-1b-lesson-merged or --model ./models/finetuned/...-merged.


Adding a custom benchmark

  1. Create src/slm_evals/benchmarks/my_bench.py subclassing BaseBenchmark:

    • load_dataset() β†’ list of sample dicts
    • build_prompt(sample) β†’ prompt string
    • evaluate_sample(sample, prediction) β†’ {passed, score, note}
  2. Register in src/slm_evals/run_benchmark.py β†’ BENCHMARK_REGISTRY.

  3. Run:

    uv run --package slm-evals slm-benchmark \
      --model ./my-model --benchmarks my_bench --max-samples 10
    

Suggested workflows

Smoke (CPU/GPU, ~5 min)

uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl \
  --max-samples 5 \
  --device cpu

Before / after fine-tune

BASE=openbmb/MiniCPM5-1B
ADAPTER=./models/finetuned/minicpm5-1b-lora

for M in "$BASE" "$ADAPTER"; do
  uv run --package slm-evals slm-benchmark \
    --model "$M" \
    --benchmarks bfcl tau_bench \
    --max-samples 100 \
    --experiment-name "$(basename "$M")__bfcl-tau"
done

Full experiment (YAML)

Edit configs/experiment_001.yaml with your model_path and experiment_name, then:

uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/experiment_001.yaml

Troubleshooting

Symptom Likely cause Fix
error: --model is required No --config and no --model Pass one of them
CUDA OOM Model too large for VRAM --dtype int4 or --device cpu
HF dataset 401 on GAIA test Gated split Use split: validation
Ο„-bench hangs / costs LLM user enabled Set use_llm_user: false
Empty results/ Wrong cwd Run from repo root or use absolute output_dir
Import errors Evals group not synced uv sync --group evals

Entry points

Path Role
slm-benchmark Agentic benchmarks (BFCL, Ο„-bench, GAIA, SWE)
slm-lm-eval Academic benchmarks via lm-evaluation-harness
python -m slm_evals.run_benchmark Same as slm-benchmark
python -m slm_evals.run_lm_eval Same as slm-lm-eval
research/evals/run_benchmark.py Thin shim for backward compatibility