Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /USAGE.md

MSG

Feat/finetuning model (#18)

6cea344 15 days ago

preview code

Raw

History Blame Contribute Delete

10.2 kB

	# Research usage

	How to run fine-tuning and agentic benchmarks under `research/`. All commands assume the repo root as the working directory unless noted.

	The Lesson Agent app lives in `apps/gradio-space/` — see root [USAGE.md](../USAGE.md). Research code is optional and isolated here.

	## Prerequisites

	- [uv](https://docs.astral.sh/uv/) and Python 3.12
	- GPU recommended for real-model runs (CPU works for smoke tests)
	- Hugging Face Hub access for model downloads and some benchmark datasets

	## Install dependency groups

	```bash
	# All research tooling
	uv sync --group finetune --group evals --group lm-eval

	# Or one at a time
	uv sync --group finetune
	uv sync --group evals
	uv sync --group lm-eval
	```

	\| Group \| Package / script \| What it adds \|
	\| ----- \| ---------------- \| ------------ \|
	\| `finetune` \| `research/finetune.py` \| `peft`, `datasets`, `bitsandbytes` (QLoRA) \|
	\| `evals` \| `slm-evals` workspace member \| `slm-benchmark` CLI \|
	\| `lm-eval` \| `slm-evals[lm-eval]` \| `slm-lm-eval` CLI (GSM8K, ARC, HellaSwag, …) \|
	\| `modal` \| `research/modal/finetune_app.py` \| Cloud GPU train + eval via [Modal](https://modal.com/docs/guide) \|
	\| `modal` \| `research/modal/server_app.py` \| Long-lived warm GPU worker for human/AI iteration loops \|

	---

	## 0. Modal cloud GPU (`research/modal/`)

	Run a skill-matrix of QLoRA fine-tunes without local CUDA: each job in
	[`modal/experiments.yaml`](modal/experiments.yaml) trains one adapter for a
	category (math, science, coding, reasoning, teaching, instructions), evaluates
	it against a matching `slm-lm-eval` profile vs. a per-profile baseline, checks
	the result against `goals`, and — only if the gate passes — publishes the
	adapter to the Hugging Face Hub. Adapters + results are saved to Modal Volume
	`slm-finetune`.

	```bash
	uv sync --group modal
	modal setup
	modal secret create huggingface HF_TOKEN=<token> # needs write access for Hub publish

	# Smoke run for one skill: baseline -> train -> eval -> gate -> publish -> pull
	modal run research/modal/finetune_app.py --job math-lora --max-steps 20

	# Whole skill matrix
	modal run research/modal/finetune_app.py

	# One category, train+eval only (no Hub push)
	modal run research/modal/finetune_app.py --category science --no-publish

	# Re-check the gate and publish an already-evaluated job
	modal run research/modal/finetune_app.py::publish_only --job math-lora

	# Pull adapters + lm-eval results without re-running anything
	modal run research/modal/finetune_app.py::pull --category math
	```

	Set real values for `defaults.hub_org` and each job's `publish.hub_repo` in
	`experiments.yaml` (placeholder: `your-hf-username`) before publishing — repos
	are created automatically. Jobs with no `goals` (e.g. `alpaca-lora`) are
	trained/evaluated but never gated or published (local-only).

	For a multi-hour session on one warm GPU (iterative human/AI loop without
	re-downloading weights each run), use `research/modal/server_app.py` instead —
	same skill-matrix pipeline (`--job`/`--category`/`--pipeline`/`--publish-only`)
	on a deployed `GpuWorker`.

	Full guide: [modal/README.md](modal/README.md) · Agent loop: [modal/SERVER.md](modal/SERVER.md) · [Modal Volumes](https://modal.com/docs/guide/volumes) · [Modal Notebooks](https://modal.com/docs/guide/notebooks)

	Iterative loop (one warm GPU, many runs):

	```bash
	modal deploy research/modal/server_app.py
	modal run -d research/modal/server_app.py --hours 6 # keep worker alive
	modal run research/modal/server_app.py --ping # verify
	modal run research/modal/server_app.py --job lesson-lora --max-steps 20
	modal app stop slm-gpu-worker -y # when done
	```

	Interactive notebook: upload [`research/notebook/minicpm5-modal-finetune.ipynb`](notebook/minicpm5-modal-finetune.ipynb) at [modal.com/notebooks](https://modal.com/notebooks), attach GPU + Volume `slm-finetune` + Secret `huggingface`.

	---

	## 1. Fine-tuning (`research/finetune.py`)

	Single script for full, LoRA, and QLoRA training. Defaults to the lesson-agent chat dataset at `research/data/education-lesson-chat.jsonl` and writes checkpoints under `models/finetuned/`.

	### Model resolution (first match wins)

	1. `--model <hf-id-or-path>`
	2. `--preset <key>` from root `models.yaml`
	3. Env: `FINETUNE_MODEL`, `MODEL_ID`, or `BASE`
	4. `ACTIVE_MODEL` preset from `.env`

	### Quick start

	```bash
	# LoRA on default lesson chat data, 1 epoch
	uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1

	# Smoke run (50 steps)
	uv run python research/finetune.py --mode lora --max_steps 50

	# QLoRA on a Hub instruction dataset
	uv run python research/finetune.py \
	--model Qwen/Qwen2.5-0.5B-Instruct \
	--dataset tatsu-lab/alpaca --format alpaca \
	--mode qlora --epochs 1

	# Merge LoRA adapter into standalone weights
	uv run python research/finetune.py \
	--merge ./models/finetuned/minicpm5-1b-lora \
	--out ./models/finetuned/minicpm5-1b-merged
	```

	### Dataset formats (`--format`)

	\| Format \| Expected columns \|
	\| ------ \| ---------------- \|
	\| `chat` \| `messages`: `[{"role": "...", "content": "..."}]` \|
	\| `alpaca` \| `instruction`, optional `input`, `output` \|
	\| `prompt` \| `prompt` / `completion` (or `response`) \|
	\| `text` \| `text`, or a plain `.txt` file \|

	Local files: `.json`, `.jsonl`, `.csv`, `.txt`. Hub ids: any `datasets` repo id.

	### Outputs

	Training writes to `<out>/` (default `./models/finetuned/<preset>-<mode>/`):

	- Adapter or full weights
	- `training_results.json` — train/eval loss, perplexity, `result_score` (0–100)

	### Env vars

	\| Variable \| Description \|
	\| -------- \| ----------- \|
	\| `FINETUNE_PRESET` \| Preset key from `models.yaml` \|
	\| `FINETUNE_DATASET` \| Override dataset path or Hub id \|
	\| `FINETUNE_DATASET_CONFIG` \| Hub config name \|
	\| `FINETUNE_DATASET_SPLIT` \| Hub split (e.g. `train[:500]`) \|
	\| `ACTIVE_MODEL` \| Fallback preset when `--preset` omitted \|

	---

	## 2. Agentic benchmarks (`research/evals/`)

	Evaluate local HuggingFace checkpoints on BFCL, τ-bench, GAIA, and SWE-bench Verified.

	Install: `uv sync --group evals`

	```bash
	# Smoke test (20 samples, two benchmarks)
	uv run --package slm-evals slm-benchmark \
	--model openbmb/MiniCPM5-1B \
	--benchmarks bfcl tau_bench \
	--max-samples 20

	# Full config-driven run
	uv run --package slm-evals slm-benchmark \
	--config research/evals/configs/experiment_001.yaml
	```

	Full reference: [evals/USAGE.md](evals/USAGE.md).

	---

	## 3. Academic benchmarks (`slm-lm-eval`)

	Standard lm-evaluation-harness tasks (ARC, HellaSwag, GSM8K, …) for base presets, LoRA adapters, and merged checkpoints.

	Install: `uv sync --group lm-eval`

	Profile guide: [evals/docs/eval_profiles.md](evals/docs/eval_profiles.md)

	```bash
	# List claim-matched profiles (reasoning, code, understanding, …)
	uv run --package slm-evals slm-lm-eval --list-profiles

	# Run by profile name
	uv run --package slm-evals slm-lm-eval \
	--profile reasoning \
	--preset minicpm5-1b \
	--experiment-name minicpm5-1b__reasoning-baseline

	# Smoke (25 samples, arc_easy + hellaswag)
	uv run --package slm-evals slm-lm-eval \
	--profile smoke \
	--preset minicpm5-1b \
	--experiment-name minicpm5-1b__smoke

	# Full profile
	uv run --package slm-evals slm-lm-eval \
	--config research/evals/configs/lm_eval_minicpm5.yaml \
	--preset minicpm5-1b-lesson-lora \
	--experiment-name minicpm5-1b-lora__v1 \
	--compare-to results/lm_eval/minicpm5-1b__baseline/results.json
	```

	Post-training hook:

	```bash
	uv run python research/finetune.py \
	--preset minicpm5-1b --mode lora --max_steps 50 \
	--lm-eval-after \
	--lm-eval-baseline minicpm5-1b
	```

	Full reference: [evals/USAGE.md](evals/USAGE.md#lm-evaluation-harness-slm-lm-eval).

	---

	## Shared data (`research/data/`)

	\| File \| Used by \| Format \|
	\| ---- \| ------- \| ------ \|
	\| `education-lesson-chat.jsonl` \| `finetune.py` default \| Chat messages for lesson agent \|
	\| `benchmark-qa.jsonl` \| Optional domain QA evals \| `question`, `answer`, `domain` \|
	\| `benchmark-kb.jsonl` \| Optional retrieval snippets \| KB entries for domain QA \|

	---

	## Suggested end-to-end pipeline

	1. Baseline lm-eval — academic benchmarks on the base preset (pinned seed):
	```bash
	uv run --package slm-evals slm-lm-eval \
	--config research/evals/configs/lm_eval_compare_study.yaml \
	--preset minicpm5-1b \
	--experiment-name minicpm5-1b__baseline
	```

	2. Baseline agentic eval (optional):
	```bash
	uv run --package slm-evals slm-benchmark \
	--model openbmb/MiniCPM5-1B --benchmarks bfcl --max-samples 50
	```

	3. Fine-tune on lesson data:
	```bash
	uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1
	```

	4. Re-eval candidate with the same lm-eval config:
	```bash
	uv run --package slm-evals slm-lm-eval \
	--config research/evals/configs/lm_eval_compare_study.yaml \
	--preset minicpm5-1b-lesson-lora \
	--experiment-name minicpm5-1b-lora__v1 \
	--compare-to results/lm_eval/minicpm5-1b__baseline/results.json
	```

	### Verification checklist

	- Use the same lm-eval YAML (`tasks`, `num_fewshot`, `limit`, `seed`) for baseline and candidate runs.
	- Compare lm-eval `results.json` files with `--compare-to`; do not compare `training_results.json` `result_score` to lm-eval accuracy.
	- For LoRA checkpoints, prefer `--preset minicpm5-1b-lesson-lora` (base + adapter) over passing the adapter dir alone to `--model`.
	- Report mean ± std only after multiple training seeds; single-seed deltas are indicative, not conclusive.

	---

	## Troubleshooting

	\| Symptom \| Fix \|
	\| ------- \| --- \|
	\| `slm-benchmark: command not found` \| `uv sync --group evals` \|
	\| `slm-lm-eval: command not found` \| `uv sync --group lm-eval` \|
	\| CUDA OOM during finetune \| Use `--mode qlora` or reduce batch size in script args \|
	\| BFCL / GAIA download slow \| Set `max_samples` low first; cache HF datasets under `~/.cache/huggingface` \|
	\| SWE-bench Docker errors \| Keep `full_eval: false` in YAML unless `swebench` + Docker are installed \|
	\| τ-bench API costs \| Keep `use_llm_user: false` (rule-based user simulator) \|