MSG
Feat/last hour (#24)
bbff1ca
|
Raw
History Blame Contribute Delete
33.4 kB
# Modal finetune + benchmark
GPU fine-tuning + benchmarking + Hub publishing on [Modal](https://modal.com/docs/guide) for `openbmb/MiniCPM5-1B`, wrapping existing [`research/finetune.py`](../finetune.py) and `slm-lm-eval`.
Use this when you have no local CUDA but want a hackathon-quality
**train → eval → gate → publish** loop for a whole **skill matrix** of QLoRA
adapters (math, science, coding, reasoning, teaching, instructions).
| Track | What you ship |
| ----- | ------------- |
| **Modal** | `modal run` skill-matrix pipeline, Volume artifacts, optional Modal Notebook |
| **Well-Tuned** | Per-skill before/after `lm-eval` + gated Hub publish for each LoRA |
---
## Layout
```text
research/modal/
├── _common.py # Shared image, volumes, command builders, gate + publish helpers
├── finetune_app.py # One-shot batch pipeline (slm-finetune-benchmark): main, publish_only, pull
├── server_app.py # Long-lived GPU worker (slm-gpu-worker): GpuWorker.run_pipeline
├── experiments.yaml # Skill matrix: jobs, eval_profile, goals, publish
├── README.md # Full Modal docs (this file)
└── SERVER.md # Human + AI agent loop runbook (quick reference)
```
Interactive path: [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (Modal GPU Notebook).
### Which app to use
| App | CLI | Best for |
| --- | --- | --- |
| **`finetune_app.py`** | `modal run research/modal/finetune_app.py` | Full sweep, CI-style batch, parallel jobs |
| **`server_app.py`** | `modal deploy` + `modal run research/modal/server_app.py` | Multi-hour session, iterative human/AI loops on **one warm GPU** |
Both apps share [`_common.py`](_common.py): same image, `hf-cache` / `slm-finetune` volumes, and wrappers around [`research/finetune.py`](../finetune.py) + `slm-lm-eval`.
---
## One-time setup
```bash
# Modal CLI + auth
pip install modal
modal setup
# HF token (downloads + Hub upload). Same token as huggingface-cli login.
modal secret create huggingface HF_TOKEN=<your-hf-token>
# Optional: validate deps before first image build
uv sync --group finetune --group lm-eval --package slm-evals
uv sync --group modal # local orchestration only
```
`HF_TOKEN` must be a [write token](https://huggingface.co/settings/tokens) if you plan to push adapters to the Hub.
---
## Run training + benchmarks
All commands from **repo root**. `finetune_app.py` runs the full **skill-matrix
pipeline**: per-profile **base-model** baseline lm-eval (no adapter) → finetune each job's QLoRA adapter →
post-train lm-eval vs. that baseline → check `goals` (gate) → publish to the
Hugging Face Hub if the gate passes → pull adapter + results to your laptop.
```bash
# Full sweep: every job in experiments.yaml
modal run research/modal/finetune_app.py
# One skill (cheap smoke run)
modal run research/modal/finetune_app.py --job math-lora --max-steps 20
# One category (e.g. all "science" jobs)
modal run research/modal/finetune_app.py --category science
# Re-run lm-eval (+ gate + publish) only — adapter already on Volume
modal run research/modal/finetune_app.py --eval-only --job math-lora
# Train + eval but skip the Hub push and the local download
modal run research/modal/finetune_app.py --no-publish --no-pull
# Train/eval jobs in parallel (one GPU per job — higher cost)
modal run research/modal/finetune_app.py --parallel
# Re-run just the gate + Hub publish for an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora
# Pull adapters + lm-eval results for a category without re-running anything
modal run research/modal/finetune_app.py::pull --category math
```
Jobs live in [`experiments.yaml`](experiments.yaml) — a **skill matrix**, one
QLoRA adapter per category, each evaluated against the matching
`eval_profile` from [`research/evals/configs/eval_profiles.yaml`](../evals/configs/eval_profiles.yaml):
| Job | Category | Dataset (format) | Eval profile | `goals` task | Publish |
| --- | -------- | ----------------- | ------------ | ------------- | ------- |
| `teaching-lora` | teaching | `research/data/education-lesson-chat.jsonl` (`chat`) | `instructions` | `ifeval` | ✅ |
| `science-lora` | science | `research/data/science-tutor-chat.jsonl` (`chat`) | `science` | `sciq` (+ `arc_challenge` guard) | ✅ |
| `math-lora` | math | `TIGER-Lab/MathInstruct` (`alpaca`) | `math` | `gsm8k` (+ `arc_challenge` guard) | ✅ |
| `coding-lora` | coding | `iamtarun/python_code_instructions_18k_alpaca` (`alpaca`) | `code` | `mbpp` | ✅ |
| `reasoning-lora` | reasoning | `HuggingFaceTB/smoltalk` (`chat`) | `reasoning` | `gsm8k` (+ `hellaswag` guard) | ✅ |
| `language-lesson-lora` | language | `language-lesson-fr/ar.jsonl` (`chat`) | `multilingual` | `xnli` (+ `hellaswag` guard) | ✅ |
| `french-lora` | french | `FrancophonIA/english_french` (`prompt`) + FR chat | `french` | `french_bench_xnli` (+ `hellaswag` guard) | ✅ |
| `alpaca-lora` | instructions | `tatsu-lab/alpaca` (`alpaca`) | `instructions` | — (no `goals`) | local-only |
Before publishing, replace `defaults.hub_org` and each job's `publish.hub_repo`
in `experiments.yaml` with your Hugging Face username/org (defaults to the
placeholder `your-hf-username`).
Edit `defaults.max_steps`, per-job `gpu`, or per-job `max_samples` /
`dataset_split` in `experiments.yaml` to balance cost vs quality. See
[Benchmark gate & Hugging Face Hub publish](#benchmark-gate--hugging-face-hub-publish)
for the `goals`/`publish` schema.
### CLI flags (`finetune_app.py`)
`main` (default entrypoint — full pipeline):
| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--train` / `--no-train` | train on | Run finetune jobs |
| `--eval-only` | off | Skip train; eval existing Volume checkpoints (still runs missing base-model baselines) |
| `--parallel` | off | `finetune_one.spawn()` per job instead of sequential |
| `--job` | all jobs | Run one job name from `experiments.yaml` |
| `--category` | all categories | Run all jobs with this `category` |
| `--max-steps` | from YAML | Override training steps |
| `--publish` / `--no-publish` | publish on | Push to `publish.hub_repo` if the gate passes |
| `--pull` / `--no-pull` | pull on | `modal volume get` the adapter + lm-eval results after each job |
`publish_only` (separate entrypoint — `::publish_only`):
| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--job` | required | Re-check the gate against existing results and publish if it passes |
`pull` (separate entrypoint — `::pull`):
| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--job` | — | Pull one job's adapter + results |
| `--category` | — | Pull all jobs in a category |
| `--dest` | `models/finetuned` | Local destination directory |
---
## GPU worker (`server_app.py`) — human + AI agent loops
Use this when you want **one warm A10G container** for several hours and many train/eval commands **without** reinstalling deps or re-downloading HF weights each time.
**Quick runbook:** see [`SERVER.md`](SERVER.md) (copy-paste commands for humans and coding agents).
### Deploy once
```bash
modal deploy research/modal/server_app.py
```
App name: **`slm-gpu-worker`**. Dashboard: `modal app list` or the URL printed after deploy.
`GpuWorker` keeps `min_containers=1` while deployed, mounts `hf-cache` + `slm-finetune`, and reuses the same container for sequential `.remote()` calls when possible.
### Two-terminal loop (recommended)
**Terminal 1 — keep worker alive** (default 4h; blocks unless detached):
```bash
modal run research/modal/server_app.py
# or free your terminal:
modal run -d research/modal/server_app.py --hours 6
```
**Terminal 2 — run experiments on the warm GPU** (repeat as often as you like):
```bash
# Full skill-matrix pipeline for one job on the warm container:
# per-profile baseline → train → eval → gate → publish → pull
modal run research/modal/server_app.py --job math-lora --max-steps 20
# All jobs in a category
modal run research/modal/server_app.py --category science
# Whole matrix, but skip the Hub push
modal run research/modal/server_app.py --pipeline --no-publish
# Re-eval (+ gate + publish) an existing adapter on Volume
modal run research/modal/server_app.py --eval-only --job math-lora
# Re-check the gate and publish using already-computed results
modal run research/modal/server_app.py --publish-only --job math-lora
# Arbitrary command in /repo (same env as finetune.py)
modal run research/modal/server_app.py --cmd "uv run python research/finetune.py --help"
# Health check
modal run research/modal/server_app.py --ping
```
Task flags (`--job`, `--category`, `--cmd`, `--pipeline`, `--eval-only`, `--publish-only`, `--ping`) automatically disable the default keep-alive mode.
### CLI flags (`server_app.py`)
| Flag | Default | Meaning |
| ---- | ------- | ------- |
| *(none)* | `serve=True` | Keep `GpuWorker` alive (`keep_alive`) |
| `--hours` | `4` | Keep-alive duration |
| `--no-serve` | — | Skip keep-alive (auto when any task flag is set) |
| `--job` | — | Run the skill-matrix pipeline for one job |
| `--category` | — | Run the skill-matrix pipeline for all jobs in a category |
| `--pipeline` | off | Run the skill-matrix pipeline for all jobs |
| `--max-steps` | from YAML | Override training steps |
| `--eval-only` | off | Pipeline eval/gate/publish only (skip train; still runs missing base-model baselines) |
| `--publish` / `--no-publish` | publish on | Push to `publish.hub_repo` if the gate passes |
| `--publish-only` | off | Re-check the gate against existing results and publish (requires `--job`) |
| `--pull` / `--no-pull` | pull on | `modal volume get` adapter + results after the pipeline |
| `--cmd` | — | Shell command (parsed with `shlex`) |
| `--ping` | off | Return worker status JSON |
### `GpuWorker` methods (for notebooks / Python callers)
After `modal deploy`, call from Python:
```python
import modal
Worker = modal.Cls.from_name("slm-gpu-worker", "GpuWorker")
w = Worker()
w.ping.remote()
w.finetune.remote({"name": "math-lora", "dataset": "...", "format": "alpaca", "max_steps": 20})
w.lm_eval.remote(experiment_name="math-lora__math", config="research/evals/configs/lm_eval_math.yaml", adapter_path="/vol/finetuned/math-lora")
w.exec_cmd.remote(["uv", "run", "python", "research/finetune.py", "--help"])
w.run_pipeline.remote(job_names=["math-lora"], max_steps=20)
# Gate + publish (only pushes to the Hub if gate_result["passed"])
gate = w.check_gate.remote(
candidate_results_path="/vol/finetuned/results/lm_eval/math-lora__math/results.json",
baseline_results_path="/vol/finetuned/results/lm_eval/minicpm5-1b__baseline__math/results.json",
goals={"task": "gsm8k", "min_score": 0.05, "min_improve": 0.02},
)
w.publish_adapter.remote(job=..., adapter_dir="/vol/finetuned/math-lora", gate_result=gate, ...)
```
Inside the class, `run_pipeline` chains `lm_eval` (baselines) → `finetune``lm_eval` (candidate) → `check_gate``publish_adapter` via `.local()`, so everything runs in the **same** container without extra cold starts.
### Persistence (what survives between commands)
| Layer | Survives | Notes |
| ----- | -------- | ----- |
| **Image** (`uv sync` baked in) | Across all runs | Rebuilds only when image definition changes |
| **`hf-cache` Volume** | Across runs | Base weights + datasets; committed after each job |
| **`slm-finetune` Volume** | Across runs | Adapters + lm-eval results |
| **Warm container** | While deployed + idle &lt; `scaledown_window` | `min_containers=1`; max idle grace **3600s** (Modal limit) |
| **`keep_alive` loop** | Up to `--hours` | Container stays active; no scale-down during loop |
### Stop / logs
```bash
modal app logs slm-gpu-worker -f # stream logs
modal app stop slm-gpu-worker # stop deployed app + warm pool
modal app stop slm-gpu-worker -y # no confirmation prompt
```
Refs: [`modal app`](https://modal.com/docs/reference/cli/app) · [`modal run`](https://modal.com/docs/reference/cli/run) · [`modal shell`](https://modal.com/docs/reference/cli/shell)
### Agent loop pattern
For an AI agent iterating on finetune hyperparameters or eval configs:
1. Ensure worker is up: `modal run research/modal/server_app.py --ping` → `{"status": "ok"}`.
2. If ping fails, human or agent runs `modal deploy research/modal/server_app.py` then `modal run -d research/modal/server_app.py --hours 6`.
3. Agent runs smoke train+eval+gate (no publish yet): `--job math-lora --max-steps 5 --no-publish`.
4. Agent re-evals without retraining: `--eval-only --job math-lora`.
5. Agent reads results: `modal volume get slm-finetune results/lm_eval/math-lora__math ./results/lm_eval/math-lora__math` or `modal volume ls slm-finetune`.
6. Agent adjusts `experiments.yaml`'s `goals`/`max_steps`/`max_samples`, repeats from step 3.
7. Once the gate passes and `hub_org`/`hub_repo` are real: `--publish-only --job math-lora`, or just drop `--no-publish`.
8. When done: `modal app stop slm-gpu-worker` (optional, stops GPU billing from warm pool).
See [`SERVER.md`](SERVER.md) for a structured checklist and error recovery table.
---
## What gets saved on Modal
Modal persists artifacts on [**Volumes**](https://modal.com/docs/guide/volumes) — a distributed filesystem optimized for write-once, read-many workloads like model checkpoints. Files written only to the container disk (outside the mount path) are **not** saved.
| Volume | Mount in container | Contents |
| ------ | ------------------ | -------- |
| `slm-finetune` | `/vol/finetuned` | LoRA adapters, `training_results.json`, lm-eval `results/` |
| `hf-cache` | `/root/.cache/huggingface` | Cached base weights + datasets |
Volumes are created lazily on first run (`create_if_missing=True` in [`finetune_app.py`](finetune_app.py)).
### Commits and visibility
Per the [Volumes guide](https://modal.com/docs/guide/volumes):
- **`volume.commit()`** — persist writes so other containers and `modal volume get` can see them. Our workers call this after each train/eval job.
- **Background commits** — Modal also snapshots attached Volumes every few seconds and on container shutdown, but explicit `commit()` is safest before download.
- **`volume.reload()`** — needed only if the *same* container must see writes from another container without restarting. Each `finetune_one.remote()` / `run_lm_eval.remote()` starts fresh and mounts the latest committed state.
Training writes under `/vol/finetuned/...` (the mount), not `/repo/models/...`. That matches Modal’s [model checkpointing](https://modal.com/docs/guide/volumes#model-checkpointing) pattern: point `finetune.py --out` at the Volume path.
### Per-job adapter layout
Each finetune job writes to a Volume path named after the job (e.g. `math-lora/`).
lm-eval results live under `results/lm_eval/`, named
`<job_name>__<eval_profile>` for candidates and `<preset>__baseline__<eval_profile>`
for the shared per-profile baselines:
```text
slm-finetune (Volume)
├── math-lora/
│ ├── adapter_config.json
│ ├── adapter_model.safetensors # or adapter_model.bin
│ ├── tokenizer files…
│ ├── training_results.json
│ └── README.md # model card, written by publish_adapter
├── science-lora/
├── coding-lora/
├── reasoning-lora/
├── teaching-lora/
├── alpaca-lora/
└── results/lm_eval/
├── minicpm5-1b__baseline__math/ # shared by all "math" profile jobs
├── minicpm5-1b__baseline__science/
├── minicpm5-1b__baseline__instructions/
├── math-lora__math/
├── science-lora__science/
└── ...
```
Because `eval_profile` is shared across jobs (e.g. `teaching-lora` and
`alpaca-lora` both use `instructions`), the `instructions` baseline is computed
once per pipeline run and reused for both jobs' gates.
---
## Volume CLI (browse, download, upload)
Official reference: [Modal Volumes guide](https://modal.com/docs/guide/volumes) · [CLI reference](https://modal.com/docs/reference/cli/volume)
### Create or list volumes
```bash
modal volume list
modal volume create slm-finetune # optional; app creates on first run
modal volume ls slm-finetune
modal volume ls slm-finetune lesson-lora
```
### Browse in a shell
Volumes are mounted under `/mnt` in an interactive shell:
```bash
modal shell --volume slm-finetune
# inside shell:
ls /mnt/slm-finetune
ls /mnt/slm-finetune/lesson-lora
du -sh /mnt/slm-finetune/lesson-lora
```
Use `du` for size — Volumes do not report accurate `df` / `disk_usage()` values ([docs](https://modal.com/docs/guide/volumes#disk-usage-reporting)).
### Download LoRA to your machine
**Use the CLI for adapter weights.** The Modal web UI only supports downloads up to **16 MB** per file; `adapter_model.safetensors` is usually larger ([docs](https://modal.com/docs/guide/volumes#downloading-a-file-from-a-volume)).
```bash
mkdir -p ./models/finetuned
# One job folder → local path expected by models.yaml
modal volume get slm-finetune lesson-lora ./models/finetuned/minicpm5-1b-lora
# lm-eval artifacts
mkdir -p ./results
modal volume get slm-finetune results/lm_eval ./results/lm_eval
# Entire volume (large)
modal volume get slm-finetune / ./modal-artifacts
```
Job folders use the **job name** from `experiments.yaml` (`lesson-lora`), not `minicpm5-1b-lora`. Root [`models.yaml`](../../models.yaml) preset `minicpm5-1b-lesson-lora` expects `./models/finetuned/minicpm5-1b-lora`.
If you downloaded to a different folder name:
```bash
modal volume get slm-finetune lesson-lora ./models/finetuned/lesson-lora
cp -r ./models/finetuned/lesson-lora ./models/finetuned/minicpm5-1b-lora
```
### Upload to a Volume from local
Push a local adapter or merged checkpoint back to Modal ([`modal volume put`](https://modal.com/docs/reference/cli/volume)):
```bash
modal volume put slm-finetune ./models/finetuned/minicpm5-1b-lora lesson-lora
```
Or from Python ([`batch_upload`](https://modal.com/docs/guide/volumes#using-a-volume-from-local-code)):
```python
import modal
vol = modal.Volume.from_name("slm-finetune")
with vol.batch_upload() as batch:
batch.put_directory(
"./models/finetuned/minicpm5-1b-lora",
"/lesson-lora",
)
```
### Copy within a Volume
```bash
modal volume cp slm-finetune lesson-lora lesson-lora-backup
```
### Parallel training note
With `--parallel`, multiple jobs write to **different** folders on the same Volume. On Volumes v1, avoid more than ~5 concurrent writers/commits ([docs](https://modal.com/docs/guide/volumes#volume-commits-and-reloads)). Prefer sequential runs unless you use Volumes v2 (`modal volume create --version=2`).
---
## Use downloaded weights locally
```bash
# Gradio / inference preset
export ACTIVE_MODEL=minicpm5-1b-lesson-lora
uv run --package gradio-space python -m gradio_space.app
# lm-eval on downloaded adapter
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_smoke.yaml \
--preset minicpm5-1b-lesson-lora \
--experiment-name minicpm5-1b-lora__local-check
```
### Optional: merge LoRA into full weights locally
Adapters are small; merged weights are easier for some deploy targets.
```bash
uv run python research/finetune.py \
--merge ./models/finetuned/minicpm5-1b-lora \
--out ./models/finetuned/minicpm5-1b-lora-merged
```
Then use preset `minicpm5-1b-lesson-merged` or `--model ./models/finetuned/minicpm5-1b-lora-merged`.
---
## Benchmark gate & Hugging Face Hub publish
`finetune_app.py` / `server_app.py` publish adapters to the Hub **automatically**,
but only when a job's lm-eval results pass its `goals`. This is the
"only ship it if it's actually better" gate.
### `goals` schema (per job in `experiments.yaml`)
```yaml
goals:
task: gsm8k # lm-eval task name, scored via primary_metric() (same as summary.md)
min_score: 0.05 # candidate score must be >= this
min_improve: 0.02 # candidate - baseline must be >= this (baseline = per-profile baseline run)
guard_tasks: # optional regression guards — must NOT regress more than max_regress
- task: arc_challenge
max_regress: 0.03
```
Publishable jobs also run a **general** eval (`defaults.general_eval_profile`, default
`compare_study`: arc_easy, arc_challenge, hellaswag, piqa, boolq, gsm8k) and must pass
`defaults.general_goals` regression guards so skill tuning does not wash out general
capability. The publish gate requires **both** skill `goals` and `general_goals` to pass.
A job with no `goals` (e.g. `alpaca-lora`) is never gated and never published —
it's local-only (still trained, evaluated, and pulled to your laptop).
### `publish` schema (per job)
```yaml
publish:
hub_repo: your-hf-username/minicpm5-1b-math-lora
private: false # public so judges can verify the Well-Tuned badge; set true to keep it hidden
```
### What happens on a passing gate
1. `run_lm_eval` writes skill results to `results/lm_eval/<job>__<profile>/results.json`.
2. For publishable jobs, a second run writes general results to
`results/lm_eval/<job>__<general_eval_profile>/results.json`.
3. `check_gate` compares skill results against `results/lm_eval/<preset>__baseline__<profile>/results.json`
and general results against `results/lm_eval/<preset>__baseline__<general_eval_profile>/results.json`
using `goals` + `general_goals``{"passed": bool, "skill": {...}, "general": {...}, "checks": [...]}`.
4. If `passed` and `publish` is set, `publish_adapter`:
- renders a model card (`README.md`) into the adapter directory — base model,
gate checks table, full lm-eval baseline-vs-candidate-vs-delta table,
training stats, and a PEFT load snippet
- `huggingface_hub.HfApi().create_repo(..., exist_ok=True)` +
`upload_folder(...)` to `publish.hub_repo`
If the gate fails, nothing is pushed — rerun with different `max_steps` /
dataset / `goals`, then `modal run research/modal/finetune_app.py::publish_only --job <name>`
once it passes (re-checks the gate against the latest results before publishing).
### Setup
```bash
huggingface-cli login
# or: export HF_TOKEN=hf_... (needs write access; same token as `modal secret create huggingface`)
```
Set real values for `defaults.hub_org` and each job's `publish.hub_repo` in
`experiments.yaml` before running with `--publish` (the default). Repos are
created automatically (`exist_ok=True`) — no need to pre-create them on huggingface.co.
---
## Manual Hugging Face Hub publish (fallback)
Use this if you'd rather download an adapter and push it yourself — e.g. for
**merged full weights**, or adapters trained before the gate/publish pipeline
existed.
### Prerequisites
```bash
huggingface-cli login
# or: export HF_TOKEN=hf_...
```
Create an empty model repo on Hugging Face (e.g. `your-user/minicpm5-1b-lesson-lora`).
### Option A — Upload LoRA adapter (recommended)
After `modal volume get`:
```bash
ADAPTER=./models/finetuned/minicpm5-1b-lora
REPO=your-user/minicpm5-1b-lesson-lora
huggingface-cli upload "$REPO" "$ADAPTER" . \
--repo-type model \
--commit-message "Lesson LoRA from Modal finetune"
```
Add a minimal `README.md` in the adapter folder before upload (or edit on the Hub) documenting the base model:
```markdown
# MiniCPM5-1B lesson LoRA
- Base model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- Dataset: education lesson chat (Build Small hackathon)
- Load with PEFT: `PeftModel.from_pretrained(base, "your-user/minicpm5-1b-lesson-lora")`
```
**Load from Hub in Python:**
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "openbmb/MiniCPM5-1B"
adapter = "your-user/minicpm5-1b-lesson-lora"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)
```
### Option B — Upload merged weights
```bash
uv run python research/finetune.py \
--merge ./models/finetuned/minicpm5-1b-lora \
--out ./models/finetuned/minicpm5-1b-lora-merged
huggingface-cli upload your-user/minicpm5-1b-lesson-merged \
./models/finetuned/minicpm5-1b-lora-merged . \
--repo-type model
```
Consumers set `MODEL_ID=your-user/minicpm5-1b-lesson-merged` with no adapter.
### Option C — Upload from Modal shell (no local download)
Browse the Volume in a shell ([docs](https://modal.com/docs/guide/volumes#using-a-volume-from-outside-of-modal)):
```bash
modal shell --volume slm-finetune
```
Inside the shell (volume at `/mnt/slm-finetune`):
```bash
pip install huggingface_hub
export HF_TOKEN=... # write token
huggingface-cli upload your-user/minicpm5-1b-lesson-lora \
/mnt/slm-finetune/lesson-lora . --repo-type model
```
Downloading to your laptop first (Option A) is usually easier to review before publish.
### Use on Hugging Face Space
**LoRA on Space (Gradio SDK):**
1. Upload adapter repo (Option A).
2. In Space **Settings → Repository secrets**, set `HF_TOKEN` if the base model needs it.
3. In Space env vars:
```bash
ACTIVE_MODEL=minicpm5-1b
# Override adapter via custom preset or env — e.g. add to models.yaml on Space:
# adapter_path: your-user/minicpm5-1b-lesson-lora # Hub id works if peft resolves it
```
For the shipped Space, the reliable path is: download adapter → commit into repo under `models/finetuned/``ACTIVE_MODEL=minicpm5-1b-lesson-lora`, or upload **merged** weights and point `MODEL_ID` at your Hub repo.
**Merged on Space:**
```bash
ACTIVE_MODEL=custom
MODEL_ID=your-user/minicpm5-1b-lesson-merged
TRUST_REMOTE_CODE=true
```
---
## Modal Notebooks (interactive GPU)
Official guide: [Modal Notebooks](https://modal.com/docs/guide/notebooks)
Use a hosted Jupyter kernel on Modal for demos, pair programming, and quick experiments. For reproducible sweeps and CI-style runs, prefer `modal run research/modal/finetune_app.py`.
### Getting started
1. Open [modal.com/notebooks](https://modal.com/notebooks) and **upload** [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (or create a notebook and copy the cells).
2. In the **sidebar → Compute profile**, enable a **GPU** (e.g. A10G). Notebooks are serverless: you pay only while the kernel runs; idle shutdown defaults to 10 minutes.
3. Attach resources in the sidebar **Files** panel:
- **Volume** `slm-finetune` → appears under `/mnt/slm-finetune` (share checkpoints with `modal run` jobs)
- **Secret** `huggingface` → injects `HF_TOKEN` for Hub downloads
4. Run cells top to bottom.
The default notebook image includes PyTorch, Transformers, and NumPy. Install extras with:
```python
%uv pip install uv peft bitsandbytes datasets
```
### Persist checkpoints on a Volume
The container filesystem is **ephemeral**. Anything under `/root` is lost when the kernel stops. Write adapters to an attached Volume:
```python
OUT = "/mnt/slm-finetune/lesson-lora-notebook" # survives kernel restarts
```
After training, download from the **Files** panel (⬇) or locally:
```bash
modal volume get slm-finetune lesson-lora-notebook ./models/finetuned/minicpm5-1b-lora
```
### Custom image (optional, full repo deps)
To match the `modal run` environment exactly, deploy the app image once:
```bash
modal deploy research/modal/finetune_app.py
```
Then in the notebook sidebar, search for function `finetune_one` from app `slm-finetune-benchmark` and select that image as the kernel.
Or call deployed functions from a cell with [`%modal` magic](https://modal.com/docs/guide/notebooks#cell-magic):
```python
%modal from slm-finetune-benchmark import finetune_one
finetune_one.remote({
"name": "lesson-lora",
"dataset": "research/data/education-lesson-chat.jsonl",
"format": "chat",
"max_steps": 20,
})
```
(Requires `modal deploy` and the repo baked into the image.)
### Share for hackathon judges
Use **Share** in the notebook editor → **public unlisted link****Can view and run** so reviewers can fork and execute without a Modal account ([docs](https://modal.com/docs/guide/notebooks#access-and-sharing)).
### Notebook vs `modal run`
| | Modal Notebook | `modal run finetune_app.py` |
| --- | --- | --- |
| Best for | Demo video, exploration | Reproducible sweep, Volume + lm-eval pipeline |
| GPU | Sidebar compute profile | `gpu="A10G"` on functions |
| Persistence | Attach Volume in sidebar | `slm-finetune` Volume auto-mounted |
| Cost | Per kernel uptime | Per function invocation |
---
## Architecture
```mermaid
flowchart LR
subgraph batch [finetune_app.py — batch]
laptop1["modal run finetune_app\n--job/--category"] --> base["run_lm_eval\n(per-profile baseline)"]
laptop1 --> train["finetune_one"]
train --> eval["run_lm_eval\n(candidate)"]
eval --> gate["check_gate\n(goals)"]
gate -- passed --> pub["publish_adapter"]
end
subgraph worker [server_app.py — warm loop]
laptop2["modal run server_app\n--job/--category/--pipeline"] --> gpu["GpuWorker A10G"]
gpu --> rp["run_pipeline\n(baseline -> train -> eval -> gate -> publish)"]
end
base --> vol["Volume slm-finetune"]
train --> vol
eval --> vol
gate --> vol
rp --> vol
gpu --> hfc["Volume hf-cache"]
pub --> hub["Hugging Face Hub\n(publish.hub_repo)"]
rp --> hub
vol --> get["modal volume get\n(pull)"]
get --> local["models/finetuned/<job>"]
local --> space["HF Space ACTIVE_MODEL"]
```
| Resource | Role |
| -------- | ---- |
| App `slm-finetune-benchmark` | One-shot batch pipeline (`finetune_app.py`): `main`, `publish_only`, `pull` |
| App `slm-gpu-worker` | Long-lived GPU worker (`server_app.py`): `GpuWorker.run_pipeline` |
| GPU `A10G` (or per-job `gpu:` override) | Default for train + eval |
| Secret `huggingface` | `HF_TOKEN` for HF downloads + Hub publish |
| [`_common.py`](_common.py) | Shared image, volumes, command builders, gate (`evaluate_gate`/`check_gate_files`), publish (`publish_adapter_files`, `render_model_card`) |
| [`experiments.yaml`](experiments.yaml) | Skill matrix: jobs, `eval_profile`, `goals`, `publish` |
| [`eval_profiles.yaml`](../evals/configs/eval_profiles.yaml) | Maps `eval_profile` → lm-eval config + task list |
| [`finetune.py`](../finetune.py) | Training logic (unchanged) |
| `slm-lm-eval` | Academic benchmarks |
---
## Troubleshooting
| Symptom | Fix |
| ------- | --- |
| `Secret huggingface not found` | `modal secret create huggingface HF_TOKEN=...` |
| Volume empty after run | Job may have failed; `modal volume ls slm-finetune`; ensure writes went to `/vol/finetuned` not `/repo` |
| `modal volume get` missing files | Call `commit()` completed; for same-container reads use `volume.reload()` |
| Large file won't download in UI | Use `modal volume get` CLI (16 MB UI limit) |
| `modal volume get` path wrong | Job name = top-level folder (e.g. `math-lora`, not `minicpm5-1b-lora`) |
| Gate fails / `published: false, reason: "gate failed"` | Check `gate.checks` in the output; adjust `goals` (`min_score`/`min_improve`/`guard_tasks`), `max_steps`, or dataset, then rerun |
| `published: false, reason: "no publish config..."` | Job has no `publish:` block in `experiments.yaml` (intentional for local-only jobs like `alpaca-lora`) |
| `Unknown eval_profile ...` | Check `eval_profile` in `experiments.yaml` matches a key in `research/evals/configs/eval_profiles.yaml` |
| Hub upload 403 | Use a write `HF_TOKEN`; repos are created automatically (`exist_ok=True`), no need to pre-create |
| Still publishing to `your-hf-username/...` | Edit `defaults.hub_org` and each job's `publish.hub_repo` in `experiments.yaml` |
| Space cannot find adapter | Use merged weights or copy adapter into repo `models/finetuned/` |
| Image build slow | `hf-cache` Volume caches weights across runs |
| OOM on GPU | `--mode qlora` in `experiments.yaml`; lower `max_len` in finetune; or set a per-job `gpu:` with more VRAM |
| `scaledown_window` deploy error | Must be 2–3600s (we use 3600); see `_common.py` |
| `server_app` ping fails | `modal deploy research/modal/server_app.py`; start keep-alive: `modal run -d research/modal/server_app.py` |
| Jobs hit different containers | Deploy first; use `server_app.py` not `finetune_app.py` for warm loop |
| Worker still billing after done | `modal app stop slm-gpu-worker` |
---
## Hackathon checklist
1. Link or screenshot of Modal app run (`slm-finetune-benchmark` or `slm-gpu-worker`), including the `--- summary ---` table (skill, category, gate, published, hub_repo).
2. `results/lm_eval/<job>__<profile>/comparison.md` — baseline vs candidate per skill.
3. At least one adapter with `goals` that passed the gate and published to the Hub (model card auto-generated).
4. Adapter on Volume or Hub + `ACTIVE_MODEL=minicpm5-1b-<skill>-lora` on Space.
5. Optional: Notebook recording of smoke train cell.
See also: [`SERVER.md`](SERVER.md) · [research/USAGE.md](../USAGE.md) · [Modal Volumes](https://modal.com/docs/guide/volumes) · [Modal Notebooks](https://modal.com/docs/guide/notebooks) · [Modal CUDA](https://modal.com/docs/guide/cuda)