Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Research usage
How to run fine-tuning and agentic benchmarks under research/. All commands assume the repo root as the working directory unless noted.
The Lesson Agent app lives in apps/gradio-space/ — see root USAGE.md. Research code is optional and isolated here.
Prerequisites
- uv and Python 3.12
- GPU recommended for real-model runs (CPU works for smoke tests)
- Hugging Face Hub access for model downloads and some benchmark datasets
Install dependency groups
# All research tooling
uv sync --group finetune --group evals --group lm-eval
# Or one at a time
uv sync --group finetune
uv sync --group evals
uv sync --group lm-eval
| Group | Package / script | What it adds |
|---|---|---|
finetune |
research/finetune.py |
peft, datasets, bitsandbytes (QLoRA) |
evals |
slm-evals workspace member |
slm-benchmark CLI |
lm-eval |
slm-evals[lm-eval] |
slm-lm-eval CLI (GSM8K, ARC, HellaSwag, …) |
modal |
research/modal/finetune_app.py |
Cloud GPU train + eval via Modal |
modal |
research/modal/server_app.py |
Long-lived warm GPU worker for human/AI iteration loops |
0. Modal cloud GPU (research/modal/)
Run a skill-matrix of QLoRA fine-tunes without local CUDA: each job in
modal/experiments.yaml trains one adapter for a
category (math, science, coding, reasoning, teaching, instructions), evaluates
it against a matching slm-lm-eval profile vs. a per-profile baseline, checks
the result against goals, and — only if the gate passes — publishes the
adapter to the Hugging Face Hub. Adapters + results are saved to Modal Volume
slm-finetune.
uv sync --group modal
modal setup
modal secret create huggingface HF_TOKEN=<token> # needs write access for Hub publish
# Smoke run for one skill: baseline -> train -> eval -> gate -> publish -> pull
modal run research/modal/finetune_app.py --job math-lora --max-steps 20
# Whole skill matrix
modal run research/modal/finetune_app.py
# One category, train+eval only (no Hub push)
modal run research/modal/finetune_app.py --category science --no-publish
# Re-check the gate and publish an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora
# Pull adapters + lm-eval results without re-running anything
modal run research/modal/finetune_app.py::pull --category math
Set real values for defaults.hub_org and each job's publish.hub_repo in
experiments.yaml (placeholder: your-hf-username) before publishing — repos
are created automatically. Jobs with no goals (e.g. alpaca-lora) are
trained/evaluated but never gated or published (local-only).
For a multi-hour session on one warm GPU (iterative human/AI loop without
re-downloading weights each run), use research/modal/server_app.py instead —
same skill-matrix pipeline (--job/--category/--pipeline/--publish-only)
on a deployed GpuWorker.
Full guide: modal/README.md · Agent loop: modal/SERVER.md · Modal Volumes · Modal Notebooks
Iterative loop (one warm GPU, many runs):
modal deploy research/modal/server_app.py
modal run -d research/modal/server_app.py --hours 6 # keep worker alive
modal run research/modal/server_app.py --ping # verify
modal run research/modal/server_app.py --job lesson-lora --max-steps 20
modal app stop slm-gpu-worker -y # when done
Interactive notebook: upload research/notebook/minicpm5-modal-finetune.ipynb at modal.com/notebooks, attach GPU + Volume slm-finetune + Secret huggingface.
1. Fine-tuning (research/finetune.py)
Single script for full, LoRA, and QLoRA training. Defaults to the lesson-agent chat dataset at research/data/education-lesson-chat.jsonl and writes checkpoints under models/finetuned/.
Model resolution (first match wins)
--model <hf-id-or-path>--preset <key>from rootmodels.yaml- Env:
FINETUNE_MODEL,MODEL_ID, orBASE ACTIVE_MODELpreset from.env
Quick start
# LoRA on default lesson chat data, 1 epoch
uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1
# Smoke run (50 steps)
uv run python research/finetune.py --mode lora --max_steps 50
# QLoRA on a Hub instruction dataset
uv run python research/finetune.py \
--model Qwen/Qwen2.5-0.5B-Instruct \
--dataset tatsu-lab/alpaca --format alpaca \
--mode qlora --epochs 1
# Merge LoRA adapter into standalone weights
uv run python research/finetune.py \
--merge ./models/finetuned/minicpm5-1b-lora \
--out ./models/finetuned/minicpm5-1b-merged
Dataset formats (--format)
| Format | Expected columns |
|---|---|
chat |
messages: [{"role": "...", "content": "..."}] |
alpaca |
instruction, optional input, output |
prompt |
prompt / completion (or response) |
text |
text, or a plain .txt file |
Local files: .json, .jsonl, .csv, .txt. Hub ids: any datasets repo id.
Outputs
Training writes to <out>/ (default ./models/finetuned/<preset>-<mode>/):
- Adapter or full weights
training_results.json— train/eval loss, perplexity,result_score(0–100)
Env vars
| Variable | Description |
|---|---|
FINETUNE_PRESET |
Preset key from models.yaml |
FINETUNE_DATASET |
Override dataset path or Hub id |
FINETUNE_DATASET_CONFIG |
Hub config name |
FINETUNE_DATASET_SPLIT |
Hub split (e.g. train[:500]) |
ACTIVE_MODEL |
Fallback preset when --preset omitted |
2. Agentic benchmarks (research/evals/)
Evaluate local HuggingFace checkpoints on BFCL, τ-bench, GAIA, and SWE-bench Verified.
Install: uv sync --group evals
# Smoke test (20 samples, two benchmarks)
uv run --package slm-evals slm-benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# Full config-driven run
uv run --package slm-evals slm-benchmark \
--config research/evals/configs/experiment_001.yaml
Full reference: evals/USAGE.md.
3. Academic benchmarks (slm-lm-eval)
Standard lm-evaluation-harness tasks (ARC, HellaSwag, GSM8K, …) for base presets, LoRA adapters, and merged checkpoints.
Install: uv sync --group lm-eval
Profile guide: evals/docs/eval_profiles.md
# List claim-matched profiles (reasoning, code, understanding, …)
uv run --package slm-evals slm-lm-eval --list-profiles
# Run by profile name
uv run --package slm-evals slm-lm-eval \
--profile reasoning \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__reasoning-baseline
# Smoke (25 samples, arc_easy + hellaswag)
uv run --package slm-evals slm-lm-eval \
--profile smoke \
--preset minicpm5-1b \
--experiment-name minicpm5-1b__smoke
# Full profile
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_minicpm5.yaml \
--preset minicpm5-1b-lesson-lora \
--experiment-name minicpm5-1b-lora__v1 \
--compare-to results/lm_eval/minicpm5-1b__baseline/results.json
Post-training hook:
uv run python research/finetune.py \
--preset minicpm5-1b --mode lora --max_steps 50 \
--lm-eval-after \
--lm-eval-baseline minicpm5-1b
Full reference: evals/USAGE.md.
Shared data (research/data/)
| File | Used by | Format |
|---|---|---|
education-lesson-chat.jsonl |
finetune.py default |
Chat messages for lesson agent |
benchmark-qa.jsonl |
Optional domain QA evals | question, answer, domain |
benchmark-kb.jsonl |
Optional retrieval snippets | KB entries for domain QA |
Suggested end-to-end pipeline
Baseline lm-eval — academic benchmarks on the base preset (pinned seed):
uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_compare_study.yaml \ --preset minicpm5-1b \ --experiment-name minicpm5-1b__baselineBaseline agentic eval (optional):
uv run --package slm-evals slm-benchmark \ --model openbmb/MiniCPM5-1B --benchmarks bfcl --max-samples 50Fine-tune on lesson data:
uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1Re-eval candidate with the same lm-eval config:
uv run --package slm-evals slm-lm-eval \ --config research/evals/configs/lm_eval_compare_study.yaml \ --preset minicpm5-1b-lesson-lora \ --experiment-name minicpm5-1b-lora__v1 \ --compare-to results/lm_eval/minicpm5-1b__baseline/results.json
Verification checklist
- Use the same lm-eval YAML (
tasks,num_fewshot,limit,seed) for baseline and candidate runs. - Compare lm-eval
results.jsonfiles with--compare-to; do not comparetraining_results.jsonresult_scoreto lm-eval accuracy. - For LoRA checkpoints, prefer
--preset minicpm5-1b-lesson-lora(base + adapter) over passing the adapter dir alone to--model. - Report mean ± std only after multiple training seeds; single-seed deltas are indicative, not conclusive.
Troubleshooting
| Symptom | Fix |
|---|---|
slm-benchmark: command not found |
uv sync --group evals |
slm-lm-eval: command not found |
uv sync --group lm-eval |
| CUDA OOM during finetune | Use --mode qlora or reduce batch size in script args |
| BFCL / GAIA download slow | Set max_samples low first; cache HF datasets under ~/.cache/huggingface |
| SWE-bench Docker errors | Keep full_eval: false in YAML unless swebench + Docker are installed |
| τ-bench API costs | Keep use_llm_user: false (rule-based user simulator) |