Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /modal /README.md

MSG

Feat/last hour (#24)

bbff1ca 19 days ago

preview code

Raw

History Blame Contribute Delete

33.4 kB

	# Modal finetune + benchmark

	GPU fine-tuning + benchmarking + Hub publishing on [Modal](https://modal.com/docs/guide) for `openbmb/MiniCPM5-1B`, wrapping existing [`research/finetune.py`](../finetune.py) and `slm-lm-eval`.

	Use this when you have no local CUDA but want a hackathon-quality
	train → eval → gate → publish loop for a whole skill matrix of QLoRA
	adapters (math, science, coding, reasoning, teaching, instructions).

	\| Track \| What you ship \|
	\| ----- \| ------------- \|
	\| Modal \| `modal run` skill-matrix pipeline, Volume artifacts, optional Modal Notebook \|
	\| Well-Tuned \| Per-skill before/after `lm-eval` + gated Hub publish for each LoRA \|

	---

	## Layout

	```text
	research/modal/
	├── _common.py # Shared image, volumes, command builders, gate + publish helpers
	├── finetune_app.py # One-shot batch pipeline (slm-finetune-benchmark): main, publish_only, pull
	├── server_app.py # Long-lived GPU worker (slm-gpu-worker): GpuWorker.run_pipeline
	├── experiments.yaml # Skill matrix: jobs, eval_profile, goals, publish
	├── README.md # Full Modal docs (this file)
	└── SERVER.md # Human + AI agent loop runbook (quick reference)
	```

	Interactive path: [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (Modal GPU Notebook).

	### Which app to use

	\| App \| CLI \| Best for \|
	\| --- \| --- \| --- \|
	\| `finetune_app.py` \| `modal run research/modal/finetune_app.py` \| Full sweep, CI-style batch, parallel jobs \|
	\| `server_app.py` \| `modal deploy` + `modal run research/modal/server_app.py` \| Multi-hour session, iterative human/AI loops on one warm GPU \|

	Both apps share [`_common.py`](_common.py): same image, `hf-cache` / `slm-finetune` volumes, and wrappers around [`research/finetune.py`](../finetune.py) + `slm-lm-eval`.

	---

	## One-time setup

	```bash
	# Modal CLI + auth
	pip install modal
	modal setup

	# HF token (downloads + Hub upload). Same token as huggingface-cli login.
	modal secret create huggingface HF_TOKEN=<your-hf-token>

	# Optional: validate deps before first image build
	uv sync --group finetune --group lm-eval --package slm-evals
	uv sync --group modal # local orchestration only
	```

	`HF_TOKEN` must be a [write token](https://huggingface.co/settings/tokens) if you plan to push adapters to the Hub.

	---

	## Run training + benchmarks

	All commands from repo root. `finetune_app.py` runs the full **skill-matrix
	pipeline: per-profile base-model** baseline lm-eval (no adapter) → finetune each job's QLoRA adapter →
	post-train lm-eval vs. that baseline → check `goals` (gate) → publish to the
	Hugging Face Hub if the gate passes → pull adapter + results to your laptop.

	```bash
	# Full sweep: every job in experiments.yaml
	modal run research/modal/finetune_app.py

	# One skill (cheap smoke run)
	modal run research/modal/finetune_app.py --job math-lora --max-steps 20

	# One category (e.g. all "science" jobs)
	modal run research/modal/finetune_app.py --category science

	# Re-run lm-eval (+ gate + publish) only — adapter already on Volume
	modal run research/modal/finetune_app.py --eval-only --job math-lora

	# Train + eval but skip the Hub push and the local download
	modal run research/modal/finetune_app.py --no-publish --no-pull

	# Train/eval jobs in parallel (one GPU per job — higher cost)
	modal run research/modal/finetune_app.py --parallel

	# Re-run just the gate + Hub publish for an already-evaluated job
	modal run research/modal/finetune_app.py::publish_only --job math-lora

	# Pull adapters + lm-eval results for a category without re-running anything
	modal run research/modal/finetune_app.py::pull --category math
	```

	Jobs live in [`experiments.yaml`](experiments.yaml) — a skill matrix, one
	QLoRA adapter per category, each evaluated against the matching
	`eval_profile` from [`research/evals/configs/eval_profiles.yaml`](../evals/configs/eval_profiles.yaml):

	\| Job \| Category \| Dataset (format) \| Eval profile \| `goals` task \| Publish \|
	\| --- \| -------- \| ----------------- \| ------------ \| ------------- \| ------- \|
	\| `teaching-lora` \| teaching \| `research/data/education-lesson-chat.jsonl` (`chat`) \| `instructions` \| `ifeval` \| ✅ \|
	\| `science-lora` \| science \| `research/data/science-tutor-chat.jsonl` (`chat`) \| `science` \| `sciq` (+ `arc_challenge` guard) \| ✅ \|
	\| `math-lora` \| math \| `TIGER-Lab/MathInstruct` (`alpaca`) \| `math` \| `gsm8k` (+ `arc_challenge` guard) \| ✅ \|
	\| `coding-lora` \| coding \| `iamtarun/python_code_instructions_18k_alpaca` (`alpaca`) \| `code` \| `mbpp` \| ✅ \|
	\| `reasoning-lora` \| reasoning \| `HuggingFaceTB/smoltalk` (`chat`) \| `reasoning` \| `gsm8k` (+ `hellaswag` guard) \| ✅ \|
	\| `language-lesson-lora` \| language \| `language-lesson-fr/ar.jsonl` (`chat`) \| `multilingual` \| `xnli` (+ `hellaswag` guard) \| ✅ \|
	\| `french-lora` \| french \| `FrancophonIA/english_french` (`prompt`) + FR chat \| `french` \| `french_bench_xnli` (+ `hellaswag` guard) \| ✅ \|
	\| `alpaca-lora` \| instructions \| `tatsu-lab/alpaca` (`alpaca`) \| `instructions` \| — (no `goals`) \| local-only \|

	Before publishing, replace `defaults.hub_org` and each job's `publish.hub_repo`
	in `experiments.yaml` with your Hugging Face username/org (defaults to the
	placeholder `your-hf-username`).

	Edit `defaults.max_steps`, per-job `gpu`, or per-job `max_samples` /
	`dataset_split` in `experiments.yaml` to balance cost vs quality. See
	[Benchmark gate & Hugging Face Hub publish](#benchmark-gate--hugging-face-hub-publish)
	for the `goals`/`publish` schema.

	### CLI flags (`finetune_app.py`)

	`main` (default entrypoint — full pipeline):

	\| Flag \| Default \| Meaning \|
	\| ---- \| ------- \| ------- \|
	\| `--train` / `--no-train` \| train on \| Run finetune jobs \|
	\| `--eval-only` \| off \| Skip train; eval existing Volume checkpoints (still runs missing base-model baselines) \|
	\| `--parallel` \| off \| `finetune_one.spawn()` per job instead of sequential \|
	\| `--job` \| all jobs \| Run one job name from `experiments.yaml` \|
	\| `--category` \| all categories \| Run all jobs with this `category` \|
	\| `--max-steps` \| from YAML \| Override training steps \|
	\| `--publish` / `--no-publish` \| publish on \| Push to `publish.hub_repo` if the gate passes \|
	\| `--pull` / `--no-pull` \| pull on \| `modal volume get` the adapter + lm-eval results after each job \|

	`publish_only` (separate entrypoint — `::publish_only`):

	\| Flag \| Default \| Meaning \|
	\| ---- \| ------- \| ------- \|
	\| `--job` \| required \| Re-check the gate against existing results and publish if it passes \|

	`pull` (separate entrypoint — `::pull`):

	\| Flag \| Default \| Meaning \|
	\| ---- \| ------- \| ------- \|
	\| `--job` \| — \| Pull one job's adapter + results \|
	\| `--category` \| — \| Pull all jobs in a category \|
	\| `--dest` \| `models/finetuned` \| Local destination directory \|

	---

	## GPU worker (`server_app.py`) — human + AI agent loops

	Use this when you want one warm A10G container for several hours and many train/eval commands without reinstalling deps or re-downloading HF weights each time.

	Quick runbook: see [`SERVER.md`](SERVER.md) (copy-paste commands for humans and coding agents).

	### Deploy once

	```bash
	modal deploy research/modal/server_app.py
	```

	App name: `slm-gpu-worker`. Dashboard: `modal app list` or the URL printed after deploy.

	`GpuWorker` keeps `min_containers=1` while deployed, mounts `hf-cache` + `slm-finetune`, and reuses the same container for sequential `.remote()` calls when possible.

	### Two-terminal loop (recommended)

	Terminal 1 — keep worker alive (default 4h; blocks unless detached):

	```bash
	modal run research/modal/server_app.py
	# or free your terminal:
	modal run -d research/modal/server_app.py --hours 6
	```

	Terminal 2 — run experiments on the warm GPU (repeat as often as you like):

	```bash
	# Full skill-matrix pipeline for one job on the warm container:
	# per-profile baseline → train → eval → gate → publish → pull
	modal run research/modal/server_app.py --job math-lora --max-steps 20

	# All jobs in a category
	modal run research/modal/server_app.py --category science

	# Whole matrix, but skip the Hub push
	modal run research/modal/server_app.py --pipeline --no-publish

	# Re-eval (+ gate + publish) an existing adapter on Volume
	modal run research/modal/server_app.py --eval-only --job math-lora

	# Re-check the gate and publish using already-computed results
	modal run research/modal/server_app.py --publish-only --job math-lora

	# Arbitrary command in /repo (same env as finetune.py)
	modal run research/modal/server_app.py --cmd "uv run python research/finetune.py --help"

	# Health check
	modal run research/modal/server_app.py --ping
	```

	Task flags (`--job`, `--category`, `--cmd`, `--pipeline`, `--eval-only`, `--publish-only`, `--ping`) automatically disable the default keep-alive mode.

	### CLI flags (`server_app.py`)

	\| Flag \| Default \| Meaning \|
	\| ---- \| ------- \| ------- \|
	\| (none) \| `serve=True` \| Keep `GpuWorker` alive (`keep_alive`) \|
	\| `--hours` \| `4` \| Keep-alive duration \|
	\| `--no-serve` \| — \| Skip keep-alive (auto when any task flag is set) \|
	\| `--job` \| — \| Run the skill-matrix pipeline for one job \|
	\| `--category` \| — \| Run the skill-matrix pipeline for all jobs in a category \|
	\| `--pipeline` \| off \| Run the skill-matrix pipeline for all jobs \|
	\| `--max-steps` \| from YAML \| Override training steps \|
	\| `--eval-only` \| off \| Pipeline eval/gate/publish only (skip train; still runs missing base-model baselines) \|
	\| `--publish` / `--no-publish` \| publish on \| Push to `publish.hub_repo` if the gate passes \|
	\| `--publish-only` \| off \| Re-check the gate against existing results and publish (requires `--job`) \|
	\| `--pull` / `--no-pull` \| pull on \| `modal volume get` adapter + results after the pipeline \|
	\| `--cmd` \| — \| Shell command (parsed with `shlex`) \|
	\| `--ping` \| off \| Return worker status JSON \|

	### `GpuWorker` methods (for notebooks / Python callers)

	After `modal deploy`, call from Python:

	```python
	import modal

	Worker = modal.Cls.from_name("slm-gpu-worker", "GpuWorker")
	w = Worker()

	w.ping.remote()
	w.finetune.remote({"name": "math-lora", "dataset": "...", "format": "alpaca", "max_steps": 20})
	w.lm_eval.remote(experiment_name="math-lora__math", config="research/evals/configs/lm_eval_math.yaml", adapter_path="/vol/finetuned/math-lora")
	w.exec_cmd.remote(["uv", "run", "python", "research/finetune.py", "--help"])
	w.run_pipeline.remote(job_names=["math-lora"], max_steps=20)

	# Gate + publish (only pushes to the Hub if gate_result["passed"])
	gate = w.check_gate.remote(
	candidate_results_path="/vol/finetuned/results/lm_eval/math-lora__math/results.json",
	baseline_results_path="/vol/finetuned/results/lm_eval/minicpm5-1b__baseline__math/results.json",
	goals={"task": "gsm8k", "min_score": 0.05, "min_improve": 0.02},
	)
	w.publish_adapter.remote(job=..., adapter_dir="/vol/finetuned/math-lora", gate_result=gate, ...)
	```

	Inside the class, `run_pipeline` chains `lm_eval` (baselines) → `finetune` → `lm_eval` (candidate) → `check_gate` → `publish_adapter` via `.local()`, so everything runs in the same container without extra cold starts.

	### Persistence (what survives between commands)

	\| Layer \| Survives \| Notes \|
	\| ----- \| -------- \| ----- \|
	\| Image (`uv sync` baked in) \| Across all runs \| Rebuilds only when image definition changes \|
	\| `hf-cache` Volume \| Across runs \| Base weights + datasets; committed after each job \|
	\| `slm-finetune` Volume \| Across runs \| Adapters + lm-eval results \|
	\| Warm container \| While deployed + idle < `scaledown_window` \| `min_containers=1`; max idle grace 3600s (Modal limit) \|
	\| `keep_alive` loop \| Up to `--hours` \| Container stays active; no scale-down during loop \|

	### Stop / logs

	```bash
	modal app logs slm-gpu-worker -f # stream logs
	modal app stop slm-gpu-worker # stop deployed app + warm pool
	modal app stop slm-gpu-worker -y # no confirmation prompt
	```

	Refs: [`modal app`](https://modal.com/docs/reference/cli/app) · [`modal run`](https://modal.com/docs/reference/cli/run) · [`modal shell`](https://modal.com/docs/reference/cli/shell)

	### Agent loop pattern

	For an AI agent iterating on finetune hyperparameters or eval configs:

	1. Ensure worker is up: `modal run research/modal/server_app.py --ping` → `{"status": "ok"}`.
	2. If ping fails, human or agent runs `modal deploy research/modal/server_app.py` then `modal run -d research/modal/server_app.py --hours 6`.
	3. Agent runs smoke train+eval+gate (no publish yet): `--job math-lora --max-steps 5 --no-publish`.
	4. Agent re-evals without retraining: `--eval-only --job math-lora`.
	5. Agent reads results: `modal volume get slm-finetune results/lm_eval/math-lora__math ./results/lm_eval/math-lora__math` or `modal volume ls slm-finetune`.
	6. Agent adjusts `experiments.yaml`'s `goals`/`max_steps`/`max_samples`, repeats from step 3.
	7. Once the gate passes and `hub_org`/`hub_repo` are real: `--publish-only --job math-lora`, or just drop `--no-publish`.
	8. When done: `modal app stop slm-gpu-worker` (optional, stops GPU billing from warm pool).

	See [`SERVER.md`](SERVER.md) for a structured checklist and error recovery table.

	---

	## What gets saved on Modal

	Modal persists artifacts on [Volumes](https://modal.com/docs/guide/volumes) — a distributed filesystem optimized for write-once, read-many workloads like model checkpoints. Files written only to the container disk (outside the mount path) are not saved.

	\| Volume \| Mount in container \| Contents \|
	\| ------ \| ------------------ \| -------- \|
	\| `slm-finetune` \| `/vol/finetuned` \| LoRA adapters, `training_results.json`, lm-eval `results/` \|
	\| `hf-cache` \| `/root/.cache/huggingface` \| Cached base weights + datasets \|

	Volumes are created lazily on first run (`create_if_missing=True` in [`finetune_app.py`](finetune_app.py)).

	### Commits and visibility

	Per the [Volumes guide](https://modal.com/docs/guide/volumes):

	- `volume.commit()` — persist writes so other containers and `modal volume get` can see them. Our workers call this after each train/eval job.
	- Background commits — Modal also snapshots attached Volumes every few seconds and on container shutdown, but explicit `commit()` is safest before download.
	- `volume.reload()` — needed only if the same container must see writes from another container without restarting. Each `finetune_one.remote()` / `run_lm_eval.remote()` starts fresh and mounts the latest committed state.

	Training writes under `/vol/finetuned/...` (the mount), not `/repo/models/...`. That matches Modal’s [model checkpointing](https://modal.com/docs/guide/volumes#model-checkpointing) pattern: point `finetune.py --out` at the Volume path.

	### Per-job adapter layout

	Each finetune job writes to a Volume path named after the job (e.g. `math-lora/`).
	lm-eval results live under `results/lm_eval/`, named
	`<job_name>__<eval_profile>` for candidates and `<preset>__baseline__<eval_profile>`
	for the shared per-profile baselines:

	```text
	slm-finetune (Volume)
	├── math-lora/
	│ ├── adapter_config.json
	│ ├── adapter_model.safetensors # or adapter_model.bin
	│ ├── tokenizer files…
	│ ├── training_results.json
	│ └── README.md # model card, written by publish_adapter
	├── science-lora/
	├── coding-lora/
	├── reasoning-lora/
	├── teaching-lora/
	├── alpaca-lora/
	└── results/lm_eval/
	├── minicpm5-1b__baseline__math/ # shared by all "math" profile jobs
	├── minicpm5-1b__baseline__science/
	├── minicpm5-1b__baseline__instructions/
	├── math-lora__math/
	├── science-lora__science/
	└── ...
	```

	Because `eval_profile` is shared across jobs (e.g. `teaching-lora` and
	`alpaca-lora` both use `instructions`), the `instructions` baseline is computed
	once per pipeline run and reused for both jobs' gates.

	---

	## Volume CLI (browse, download, upload)

	Official reference: [Modal Volumes guide](https://modal.com/docs/guide/volumes) · [CLI reference](https://modal.com/docs/reference/cli/volume)

	### Create or list volumes

	```bash
	modal volume list
	modal volume create slm-finetune # optional; app creates on first run
	modal volume ls slm-finetune
	modal volume ls slm-finetune lesson-lora
	```

	### Browse in a shell

	Volumes are mounted under `/mnt` in an interactive shell:

	```bash
	modal shell --volume slm-finetune
	# inside shell:
	ls /mnt/slm-finetune
	ls /mnt/slm-finetune/lesson-lora
	du -sh /mnt/slm-finetune/lesson-lora
	```

	Use `du` for size — Volumes do not report accurate `df` / `disk_usage()` values ([docs](https://modal.com/docs/guide/volumes#disk-usage-reporting)).

	### Download LoRA to your machine

	Use the CLI for adapter weights. The Modal web UI only supports downloads up to 16 MB per file; `adapter_model.safetensors` is usually larger ([docs](https://modal.com/docs/guide/volumes#downloading-a-file-from-a-volume)).

	```bash
	mkdir -p ./models/finetuned

	# One job folder → local path expected by models.yaml
	modal volume get slm-finetune lesson-lora ./models/finetuned/minicpm5-1b-lora

	# lm-eval artifacts
	mkdir -p ./results
	modal volume get slm-finetune results/lm_eval ./results/lm_eval

	# Entire volume (large)
	modal volume get slm-finetune / ./modal-artifacts
	```

	Job folders use the job name from `experiments.yaml` (`lesson-lora`), not `minicpm5-1b-lora`. Root [`models.yaml`](../../models.yaml) preset `minicpm5-1b-lesson-lora` expects `./models/finetuned/minicpm5-1b-lora`.

	If you downloaded to a different folder name:

	```bash
	modal volume get slm-finetune lesson-lora ./models/finetuned/lesson-lora
	cp -r ./models/finetuned/lesson-lora ./models/finetuned/minicpm5-1b-lora
	```

	### Upload to a Volume from local

	Push a local adapter or merged checkpoint back to Modal ([`modal volume put`](https://modal.com/docs/reference/cli/volume)):

	```bash
	modal volume put slm-finetune ./models/finetuned/minicpm5-1b-lora lesson-lora
	```

	Or from Python ([`batch_upload`](https://modal.com/docs/guide/volumes#using-a-volume-from-local-code)):

	```python
	import modal

	vol = modal.Volume.from_name("slm-finetune")
	with vol.batch_upload() as batch:
	batch.put_directory(
	"./models/finetuned/minicpm5-1b-lora",
	"/lesson-lora",
	)
	```

	### Copy within a Volume

	```bash
	modal volume cp slm-finetune lesson-lora lesson-lora-backup
	```

	### Parallel training note

	With `--parallel`, multiple jobs write to different folders on the same Volume. On Volumes v1, avoid more than ~5 concurrent writers/commits ([docs](https://modal.com/docs/guide/volumes#volume-commits-and-reloads)). Prefer sequential runs unless you use Volumes v2 (`modal volume create --version=2`).

	---

	## Use downloaded weights locally

	```bash
	# Gradio / inference preset
	export ACTIVE_MODEL=minicpm5-1b-lesson-lora

	uv run --package gradio-space python -m gradio_space.app

	# lm-eval on downloaded adapter
	uv run --package slm-evals slm-lm-eval \
	--config research/evals/configs/lm_eval_smoke.yaml \
	--preset minicpm5-1b-lesson-lora \
	--experiment-name minicpm5-1b-lora__local-check
	```

	### Optional: merge LoRA into full weights locally

	Adapters are small; merged weights are easier for some deploy targets.

	```bash
	uv run python research/finetune.py \
	--merge ./models/finetuned/minicpm5-1b-lora \
	--out ./models/finetuned/minicpm5-1b-lora-merged
	```

	Then use preset `minicpm5-1b-lesson-merged` or `--model ./models/finetuned/minicpm5-1b-lora-merged`.

	---

	## Benchmark gate & Hugging Face Hub publish

	`finetune_app.py` / `server_app.py` publish adapters to the Hub automatically,
	but only when a job's lm-eval results pass its `goals`. This is the
	"only ship it if it's actually better" gate.

	### `goals` schema (per job in `experiments.yaml`)

	```yaml
	goals:
	task: gsm8k # lm-eval task name, scored via primary_metric() (same as summary.md)
	min_score: 0.05 # candidate score must be >= this
	min_improve: 0.02 # candidate - baseline must be >= this (baseline = per-profile baseline run)
	guard_tasks: # optional regression guards — must NOT regress more than max_regress
	- task: arc_challenge
	max_regress: 0.03
	```

	Publishable jobs also run a general eval (`defaults.general_eval_profile`, default
	`compare_study`: arc_easy, arc_challenge, hellaswag, piqa, boolq, gsm8k) and must pass
	`defaults.general_goals` regression guards so skill tuning does not wash out general
	capability. The publish gate requires both skill `goals` and `general_goals` to pass.

	A job with no `goals` (e.g. `alpaca-lora`) is never gated and never published —
	it's local-only (still trained, evaluated, and pulled to your laptop).

	### `publish` schema (per job)

	```yaml
	publish:
	hub_repo: your-hf-username/minicpm5-1b-math-lora
	private: false # public so judges can verify the Well-Tuned badge; set true to keep it hidden
	```

	### What happens on a passing gate

	1. `run_lm_eval` writes skill results to `results/lm_eval/<job>__<profile>/results.json`.
	2. For publishable jobs, a second run writes general results to
	`results/lm_eval/<job>__<general_eval_profile>/results.json`.
	3. `check_gate` compares skill results against `results/lm_eval/<preset>__baseline__<profile>/results.json`
	and general results against `results/lm_eval/<preset>__baseline__<general_eval_profile>/results.json`
	using `goals` + `general_goals` → `{"passed": bool, "skill": {...}, "general": {...}, "checks": [...]}`.
	4. If `passed` and `publish` is set, `publish_adapter`:
	- renders a model card (`README.md`) into the adapter directory — base model,
	gate checks table, full lm-eval baseline-vs-candidate-vs-delta table,
	training stats, and a PEFT load snippet
	- `huggingface_hub.HfApi().create_repo(..., exist_ok=True)` +
	`upload_folder(...)` to `publish.hub_repo`

	If the gate fails, nothing is pushed — rerun with different `max_steps` /
	dataset / `goals`, then `modal run research/modal/finetune_app.py::publish_only --job <name>`
	once it passes (re-checks the gate against the latest results before publishing).

	### Setup

	```bash
	huggingface-cli login
	# or: export HF_TOKEN=hf_... (needs write access; same token as `modal secret create huggingface`)
	```

	Set real values for `defaults.hub_org` and each job's `publish.hub_repo` in
	`experiments.yaml` before running with `--publish` (the default). Repos are
	created automatically (`exist_ok=True`) — no need to pre-create them on huggingface.co.

	---

	## Manual Hugging Face Hub publish (fallback)

	Use this if you'd rather download an adapter and push it yourself — e.g. for
	merged full weights, or adapters trained before the gate/publish pipeline
	existed.

	### Prerequisites

	```bash
	huggingface-cli login
	# or: export HF_TOKEN=hf_...
	```

	Create an empty model repo on Hugging Face (e.g. `your-user/minicpm5-1b-lesson-lora`).

	### Option A — Upload LoRA adapter (recommended)

	After `modal volume get`:

	```bash
	ADAPTER=./models/finetuned/minicpm5-1b-lora
	REPO=your-user/minicpm5-1b-lesson-lora

	huggingface-cli upload "$REPO" "$ADAPTER" . \
	--repo-type model \
	--commit-message "Lesson LoRA from Modal finetune"
	```

	Add a minimal `README.md` in the adapter folder before upload (or edit on the Hub) documenting the base model:

	```markdown
	# MiniCPM5-1B lesson LoRA

	- Base model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
	- Dataset: education lesson chat (Build Small hackathon)
	- Load with PEFT: `PeftModel.from_pretrained(base, "your-user/minicpm5-1b-lesson-lora")`
	```

	Load from Hub in Python:

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base = "openbmb/MiniCPM5-1B"
	adapter = "your-user/minicpm5-1b-lesson-lora"

	tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	base, torch_dtype="auto", device_map="auto", trust_remote_code=True
	)
	model = PeftModel.from_pretrained(model, adapter)
	```

	### Option B — Upload merged weights

	```bash
	uv run python research/finetune.py \
	--merge ./models/finetuned/minicpm5-1b-lora \
	--out ./models/finetuned/minicpm5-1b-lora-merged

	huggingface-cli upload your-user/minicpm5-1b-lesson-merged \
	./models/finetuned/minicpm5-1b-lora-merged . \
	--repo-type model
	```

	Consumers set `MODEL_ID=your-user/minicpm5-1b-lesson-merged` with no adapter.

	### Option C — Upload from Modal shell (no local download)

	Browse the Volume in a shell ([docs](https://modal.com/docs/guide/volumes#using-a-volume-from-outside-of-modal)):

	```bash
	modal shell --volume slm-finetune
	```

	Inside the shell (volume at `/mnt/slm-finetune`):

	```bash
	pip install huggingface_hub
	export HF_TOKEN=... # write token
	huggingface-cli upload your-user/minicpm5-1b-lesson-lora \
	/mnt/slm-finetune/lesson-lora . --repo-type model
	```

	Downloading to your laptop first (Option A) is usually easier to review before publish.

	### Use on Hugging Face Space

	LoRA on Space (Gradio SDK):

	1. Upload adapter repo (Option A).
	2. In Space Settings → Repository secrets, set `HF_TOKEN` if the base model needs it.
	3. In Space env vars:

	```bash
	ACTIVE_MODEL=minicpm5-1b
	# Override adapter via custom preset or env — e.g. add to models.yaml on Space:
	# adapter_path: your-user/minicpm5-1b-lesson-lora # Hub id works if peft resolves it
	```

	For the shipped Space, the reliable path is: download adapter → commit into repo under `models/finetuned/` → `ACTIVE_MODEL=minicpm5-1b-lesson-lora`, or upload merged weights and point `MODEL_ID` at your Hub repo.

	Merged on Space:

	```bash
	ACTIVE_MODEL=custom
	MODEL_ID=your-user/minicpm5-1b-lesson-merged
	TRUST_REMOTE_CODE=true
	```

	---

	## Modal Notebooks (interactive GPU)

	Official guide: [Modal Notebooks](https://modal.com/docs/guide/notebooks)

	Use a hosted Jupyter kernel on Modal for demos, pair programming, and quick experiments. For reproducible sweeps and CI-style runs, prefer `modal run research/modal/finetune_app.py`.

	### Getting started

	1. Open [modal.com/notebooks](https://modal.com/notebooks) and upload [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (or create a notebook and copy the cells).
	2. In the sidebar → Compute profile, enable a GPU (e.g. A10G). Notebooks are serverless: you pay only while the kernel runs; idle shutdown defaults to 10 minutes.
	3. Attach resources in the sidebar Files panel:
	- Volume `slm-finetune` → appears under `/mnt/slm-finetune` (share checkpoints with `modal run` jobs)
	- Secret `huggingface` → injects `HF_TOKEN` for Hub downloads
	4. Run cells top to bottom.

	The default notebook image includes PyTorch, Transformers, and NumPy. Install extras with:

	```python
	%uv pip install uv peft bitsandbytes datasets
	```

	### Persist checkpoints on a Volume

	The container filesystem is ephemeral. Anything under `/root` is lost when the kernel stops. Write adapters to an attached Volume:

	```python
	OUT = "/mnt/slm-finetune/lesson-lora-notebook" # survives kernel restarts
	```

	After training, download from the Files panel (⬇) or locally:

	```bash
	modal volume get slm-finetune lesson-lora-notebook ./models/finetuned/minicpm5-1b-lora
	```

	### Custom image (optional, full repo deps)

	To match the `modal run` environment exactly, deploy the app image once:

	```bash
	modal deploy research/modal/finetune_app.py
	```

	Then in the notebook sidebar, search for function `finetune_one` from app `slm-finetune-benchmark` and select that image as the kernel.

	Or call deployed functions from a cell with [`%modal` magic](https://modal.com/docs/guide/notebooks#cell-magic):

	```python
	%modal from slm-finetune-benchmark import finetune_one

	finetune_one.remote({
	"name": "lesson-lora",
	"dataset": "research/data/education-lesson-chat.jsonl",
	"format": "chat",
	"max_steps": 20,
	})
	```

	(Requires `modal deploy` and the repo baked into the image.)

	### Share for hackathon judges

	Use Share in the notebook editor → public unlisted link → Can view and run so reviewers can fork and execute without a Modal account ([docs](https://modal.com/docs/guide/notebooks#access-and-sharing)).

	### Notebook vs `modal run`

	\| \| Modal Notebook \| `modal run finetune_app.py` \|
	\| --- \| --- \| --- \|
	\| Best for \| Demo video, exploration \| Reproducible sweep, Volume + lm-eval pipeline \|
	\| GPU \| Sidebar compute profile \| `gpu="A10G"` on functions \|
	\| Persistence \| Attach Volume in sidebar \| `slm-finetune` Volume auto-mounted \|
	\| Cost \| Per kernel uptime \| Per function invocation \|

	---

	## Architecture

	```mermaid
	flowchart LR
	subgraph batch [finetune_app.py — batch]
	laptop1["modal run finetune_app\n--job/--category"] --> base["run_lm_eval\n(per-profile baseline)"]
	laptop1 --> train["finetune_one"]
	train --> eval["run_lm_eval\n(candidate)"]
	eval --> gate["check_gate\n(goals)"]
	gate -- passed --> pub["publish_adapter"]
	end
	subgraph worker [server_app.py — warm loop]
	laptop2["modal run server_app\n--job/--category/--pipeline"] --> gpu["GpuWorker A10G"]
	gpu --> rp["run_pipeline\n(baseline -> train -> eval -> gate -> publish)"]
	end
	base --> vol["Volume slm-finetune"]
	train --> vol
	eval --> vol
	gate --> vol
	rp --> vol
	gpu --> hfc["Volume hf-cache"]
	pub --> hub["Hugging Face Hub\n(publish.hub_repo)"]
	rp --> hub
	vol --> get["modal volume get\n(pull)"]
	get --> local["models/finetuned/<job>"]
	local --> space["HF Space ACTIVE_MODEL"]
	```

	\| Resource \| Role \|
	\| -------- \| ---- \|
	\| App `slm-finetune-benchmark` \| One-shot batch pipeline (`finetune_app.py`): `main`, `publish_only`, `pull` \|
	\| App `slm-gpu-worker` \| Long-lived GPU worker (`server_app.py`): `GpuWorker.run_pipeline` \|
	\| GPU `A10G` (or per-job `gpu:` override) \| Default for train + eval \|
	\| Secret `huggingface` \| `HF_TOKEN` for HF downloads + Hub publish \|
	\| [`_common.py`](_common.py) \| Shared image, volumes, command builders, gate (`evaluate_gate`/`check_gate_files`), publish (`publish_adapter_files`, `render_model_card`) \|
	\| [`experiments.yaml`](experiments.yaml) \| Skill matrix: jobs, `eval_profile`, `goals`, `publish` \|
	\| [`eval_profiles.yaml`](../evals/configs/eval_profiles.yaml) \| Maps `eval_profile` → lm-eval config + task list \|
	\| [`finetune.py`](../finetune.py) \| Training logic (unchanged) \|
	\| `slm-lm-eval` \| Academic benchmarks \|

	---

	## Troubleshooting

	\| Symptom \| Fix \|
	\| ------- \| --- \|
	\| `Secret huggingface not found` \| `modal secret create huggingface HF_TOKEN=...` \|
	\| Volume empty after run \| Job may have failed; `modal volume ls slm-finetune`; ensure writes went to `/vol/finetuned` not `/repo` \|
	\| `modal volume get` missing files \| Call `commit()` completed; for same-container reads use `volume.reload()` \|
	\| Large file won't download in UI \| Use `modal volume get` CLI (16 MB UI limit) \|
	\| `modal volume get` path wrong \| Job name = top-level folder (e.g. `math-lora`, not `minicpm5-1b-lora`) \|
	\| Gate fails / `published: false, reason: "gate failed"` \| Check `gate.checks` in the output; adjust `goals` (`min_score`/`min_improve`/`guard_tasks`), `max_steps`, or dataset, then rerun \|
	\| `published: false, reason: "no publish config..."` \| Job has no `publish:` block in `experiments.yaml` (intentional for local-only jobs like `alpaca-lora`) \|
	\| `Unknown eval_profile ...` \| Check `eval_profile` in `experiments.yaml` matches a key in `research/evals/configs/eval_profiles.yaml` \|
	\| Hub upload 403 \| Use a write `HF_TOKEN`; repos are created automatically (`exist_ok=True`), no need to pre-create \|
	\| Still publishing to `your-hf-username/...` \| Edit `defaults.hub_org` and each job's `publish.hub_repo` in `experiments.yaml` \|
	\| Space cannot find adapter \| Use merged weights or copy adapter into repo `models/finetuned/` \|
	\| Image build slow \| `hf-cache` Volume caches weights across runs \|
	\| OOM on GPU \| `--mode qlora` in `experiments.yaml`; lower `max_len` in finetune; or set a per-job `gpu:` with more VRAM \|
	\| `scaledown_window` deploy error \| Must be 2–3600s (we use 3600); see `_common.py` \|
	\| `server_app` ping fails \| `modal deploy research/modal/server_app.py`; start keep-alive: `modal run -d research/modal/server_app.py` \|
	\| Jobs hit different containers \| Deploy first; use `server_app.py` not `finetune_app.py` for warm loop \|
	\| Worker still billing after done \| `modal app stop slm-gpu-worker` \|

	---

	## Hackathon checklist

	1. Link or screenshot of Modal app run (`slm-finetune-benchmark` or `slm-gpu-worker`), including the `--- summary ---` table (skill, category, gate, published, hub_repo).
	2. `results/lm_eval/<job>__<profile>/comparison.md` — baseline vs candidate per skill.
	3. At least one adapter with `goals` that passed the gate and published to the Hub (model card auto-generated).
	4. Adapter on Volume or Hub + `ACTIVE_MODEL=minicpm5-1b-<skill>-lora` on Space.
	5. Optional: Notebook recording of smoke train cell.

	See also: [`SERVER.md`](SERVER.md) · [research/USAGE.md](../USAGE.md) · [Modal Volumes](https://modal.com/docs/guide/volumes) · [Modal Notebooks](https://modal.com/docs/guide/notebooks) · [Modal CUDA](https://modal.com/docs/guide/cuda)