# Modal finetune + benchmark

GPU fine-tuning + benchmarking + Hub publishing on [Modal](https://modal.com/docs/guide) for `openbmb/MiniCPM5-1B`, wrapping existing [`research/finetune.py`](../finetune.py) and `slm-lm-eval`.

Use this when you have no local CUDA but want a hackathon-quality
**train → eval → gate → publish** loop for a whole **skill matrix** of QLoRA
adapters (math, science, coding, reasoning, teaching, instructions).

| Track | What you ship |
| ----- | ------------- |
| **Modal** | `modal run` skill-matrix pipeline, Volume artifacts, optional Modal Notebook |
| **Well-Tuned** | Per-skill before/after `lm-eval` + gated Hub publish for each LoRA |

---

## Layout

```text
research/modal/
├── _common.py         # Shared image, volumes, command builders, gate + publish helpers
├── finetune_app.py    # One-shot batch pipeline (slm-finetune-benchmark): main, publish_only, pull
├── server_app.py      # Long-lived GPU worker (slm-gpu-worker): GpuWorker.run_pipeline
├── experiments.yaml   # Skill matrix: jobs, eval_profile, goals, publish
├── README.md          # Full Modal docs (this file)
└── SERVER.md          # Human + AI agent loop runbook (quick reference)
```

Interactive path: [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (Modal GPU Notebook).

### Which app to use

| App | CLI | Best for |
| --- | --- | --- |
| **`finetune_app.py`** | `modal run research/modal/finetune_app.py` | Full sweep, CI-style batch, parallel jobs |
| **`server_app.py`** | `modal deploy` + `modal run research/modal/server_app.py` | Multi-hour session, iterative human/AI loops on **one warm GPU** |

Both apps share [`_common.py`](_common.py): same image, `hf-cache` / `slm-finetune` volumes, and wrappers around [`research/finetune.py`](../finetune.py) + `slm-lm-eval`.

---

## One-time setup

```bash
# Modal CLI + auth
pip install modal
modal setup

# HF token (downloads + Hub upload). Same token as huggingface-cli login.
modal secret create huggingface HF_TOKEN=<your-hf-token>

# Optional: validate deps before first image build
uv sync --group finetune --group lm-eval --package slm-evals
uv sync --group modal   # local orchestration only
```

`HF_TOKEN` must be a [write token](https://huggingface.co/settings/tokens) if you plan to push adapters to the Hub.

---

## Run training + benchmarks

All commands from **repo root**. `finetune_app.py` runs the full **skill-matrix
pipeline**: per-profile **base-model** baseline lm-eval (no adapter) → finetune each job's QLoRA adapter →
post-train lm-eval vs. that baseline → check `goals` (gate) → publish to the
Hugging Face Hub if the gate passes → pull adapter + results to your laptop.

```bash
# Full sweep: every job in experiments.yaml
modal run research/modal/finetune_app.py

# One skill (cheap smoke run)
modal run research/modal/finetune_app.py --job math-lora --max-steps 20

# One category (e.g. all "science" jobs)
modal run research/modal/finetune_app.py --category science

# Re-run lm-eval (+ gate + publish) only — adapter already on Volume
modal run research/modal/finetune_app.py --eval-only --job math-lora

# Train + eval but skip the Hub push and the local download
modal run research/modal/finetune_app.py --no-publish --no-pull

# Train/eval jobs in parallel (one GPU per job — higher cost)
modal run research/modal/finetune_app.py --parallel

# Re-run just the gate + Hub publish for an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora

# Pull adapters + lm-eval results for a category without re-running anything
modal run research/modal/finetune_app.py::pull --category math
```

Jobs live in [`experiments.yaml`](experiments.yaml) — a **skill matrix**, one
QLoRA adapter per category, each evaluated against the matching
`eval_profile` from [`research/evals/configs/eval_profiles.yaml`](../evals/configs/eval_profiles.yaml):

| Job | Category | Dataset (format) | Eval profile | `goals` task | Publish |
| --- | -------- | ----------------- | ------------ | ------------- | ------- |
| `teaching-lora` | teaching | `research/data/education-lesson-chat.jsonl` (`chat`) | `instructions` | `ifeval` | ✅ |
| `science-lora` | science | `research/data/science-tutor-chat.jsonl` (`chat`) | `science` | `sciq` (+ `arc_challenge` guard) | ✅ |
| `math-lora` | math | `TIGER-Lab/MathInstruct` (`alpaca`) | `math` | `gsm8k` (+ `arc_challenge` guard) | ✅ |
| `coding-lora` | coding | `iamtarun/python_code_instructions_18k_alpaca` (`alpaca`) | `code` | `mbpp` | ✅ |
| `reasoning-lora` | reasoning | `HuggingFaceTB/smoltalk` (`chat`) | `reasoning` | `gsm8k` (+ `hellaswag` guard) | ✅ |
| `language-lesson-lora` | language | `language-lesson-fr/ar.jsonl` (`chat`) | `multilingual` | `xnli` (+ `hellaswag` guard) | ✅ |
| `french-lora` | french | `FrancophonIA/english_french` (`prompt`) + FR chat | `french` | `french_bench_xnli` (+ `hellaswag` guard) | ✅ |
| `alpaca-lora` | instructions | `tatsu-lab/alpaca` (`alpaca`) | `instructions` | — (no `goals`) | local-only |

Before publishing, replace `defaults.hub_org` and each job's `publish.hub_repo`
in `experiments.yaml` with your Hugging Face username/org (defaults to the
placeholder `your-hf-username`).

Edit `defaults.max_steps`, per-job `gpu`, or per-job `max_samples` /
`dataset_split` in `experiments.yaml` to balance cost vs quality. See
[Benchmark gate & Hugging Face Hub publish](#benchmark-gate--hugging-face-hub-publish)
for the `goals`/`publish` schema.

### CLI flags (`finetune_app.py`)

`main` (default entrypoint — full pipeline):

| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--train` / `--no-train` | train on | Run finetune jobs |
| `--eval-only` | off | Skip train; eval existing Volume checkpoints (still runs missing base-model baselines) |
| `--parallel` | off | `finetune_one.spawn()` per job instead of sequential |
| `--job` | all jobs | Run one job name from `experiments.yaml` |
| `--category` | all categories | Run all jobs with this `category` |
| `--max-steps` | from YAML | Override training steps |
| `--publish` / `--no-publish` | publish on | Push to `publish.hub_repo` if the gate passes |
| `--pull` / `--no-pull` | pull on | `modal volume get` the adapter + lm-eval results after each job |

`publish_only` (separate entrypoint — `::publish_only`):

| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--job` | required | Re-check the gate against existing results and publish if it passes |

`pull` (separate entrypoint — `::pull`):

| Flag | Default | Meaning |
| ---- | ------- | ------- |
| `--job` | — | Pull one job's adapter + results |
| `--category` | — | Pull all jobs in a category |
| `--dest` | `models/finetuned` | Local destination directory |

---

## GPU worker (`server_app.py`) — human + AI agent loops

Use this when you want **one warm A10G container** for several hours and many train/eval commands **without** reinstalling deps or re-downloading HF weights each time.

**Quick runbook:** see [`SERVER.md`](SERVER.md) (copy-paste commands for humans and coding agents).

### Deploy once

```bash
modal deploy research/modal/server_app.py
```

App name: **`slm-gpu-worker`**. Dashboard: `modal app list` or the URL printed after deploy.

`GpuWorker` keeps `min_containers=1` while deployed, mounts `hf-cache` + `slm-finetune`, and reuses the same container for sequential `.remote()` calls when possible.

### Two-terminal loop (recommended)

**Terminal 1 — keep worker alive** (default 4h; blocks unless detached):

```bash
modal run research/modal/server_app.py
# or free your terminal:
modal run -d research/modal/server_app.py --hours 6
```

**Terminal 2 — run experiments on the warm GPU** (repeat as often as you like):

```bash
# Full skill-matrix pipeline for one job on the warm container:
# per-profile baseline → train → eval → gate → publish → pull
modal run research/modal/server_app.py --job math-lora --max-steps 20

# All jobs in a category
modal run research/modal/server_app.py --category science

# Whole matrix, but skip the Hub push
modal run research/modal/server_app.py --pipeline --no-publish

# Re-eval (+ gate + publish) an existing adapter on Volume
modal run research/modal/server_app.py --eval-only --job math-lora

# Re-check the gate and publish using already-computed results
modal run research/modal/server_app.py --publish-only --job math-lora

# Arbitrary command in /repo (same env as finetune.py)
modal run research/modal/server_app.py --cmd "uv run python research/finetune.py --help"

# Health check
modal run research/modal/server_app.py --ping
```

Task flags (`--job`, `--category`, `--cmd`, `--pipeline`, `--eval-only`, `--publish-only`, `--ping`) automatically disable the default keep-alive mode.

### CLI flags (`server_app.py`)

| Flag | Default | Meaning |
| ---- | ------- | ------- |
| *(none)* | `serve=True` | Keep `GpuWorker` alive (`keep_alive`) |
| `--hours` | `4` | Keep-alive duration |
| `--no-serve` | — | Skip keep-alive (auto when any task flag is set) |
| `--job` | — | Run the skill-matrix pipeline for one job |
| `--category` | — | Run the skill-matrix pipeline for all jobs in a category |
| `--pipeline` | off | Run the skill-matrix pipeline for all jobs |
| `--max-steps` | from YAML | Override training steps |
| `--eval-only` | off | Pipeline eval/gate/publish only (skip train; still runs missing base-model baselines) |
| `--publish` / `--no-publish` | publish on | Push to `publish.hub_repo` if the gate passes |
| `--publish-only` | off | Re-check the gate against existing results and publish (requires `--job`) |
| `--pull` / `--no-pull` | pull on | `modal volume get` adapter + results after the pipeline |
| `--cmd` | — | Shell command (parsed with `shlex`) |
| `--ping` | off | Return worker status JSON |

### `GpuWorker` methods (for notebooks / Python callers)

After `modal deploy`, call from Python:

```python
import modal

Worker = modal.Cls.from_name("slm-gpu-worker", "GpuWorker")
w = Worker()

w.ping.remote()
w.finetune.remote({"name": "math-lora", "dataset": "...", "format": "alpaca", "max_steps": 20})
w.lm_eval.remote(experiment_name="math-lora__math", config="research/evals/configs/lm_eval_math.yaml", adapter_path="/vol/finetuned/math-lora")
w.exec_cmd.remote(["uv", "run", "python", "research/finetune.py", "--help"])
w.run_pipeline.remote(job_names=["math-lora"], max_steps=20)

# Gate + publish (only pushes to the Hub if gate_result["passed"])
gate = w.check_gate.remote(
    candidate_results_path="/vol/finetuned/results/lm_eval/math-lora__math/results.json",
    baseline_results_path="/vol/finetuned/results/lm_eval/minicpm5-1b__baseline__math/results.json",
    goals={"task": "gsm8k", "min_score": 0.05, "min_improve": 0.02},
)
w.publish_adapter.remote(job=..., adapter_dir="/vol/finetuned/math-lora", gate_result=gate, ...)
```

Inside the class, `run_pipeline` chains `lm_eval` (baselines) → `finetune` → `lm_eval` (candidate) → `check_gate` → `publish_adapter` via `.local()`, so everything runs in the **same** container without extra cold starts.

### Persistence (what survives between commands)

| Layer | Survives | Notes |
| ----- | -------- | ----- |
| **Image** (`uv sync` baked in) | Across all runs | Rebuilds only when image definition changes |
| **`hf-cache` Volume** | Across runs | Base weights + datasets; committed after each job |
| **`slm-finetune` Volume** | Across runs | Adapters + lm-eval results |
| **Warm container** | While deployed + idle &lt; `scaledown_window` | `min_containers=1`; max idle grace **3600s** (Modal limit) |
| **`keep_alive` loop** | Up to `--hours` | Container stays active; no scale-down during loop |

### Stop / logs

```bash
modal app logs slm-gpu-worker -f          # stream logs
modal app stop slm-gpu-worker             # stop deployed app + warm pool
modal app stop slm-gpu-worker -y          # no confirmation prompt
```

Refs: [`modal app`](https://modal.com/docs/reference/cli/app) · [`modal run`](https://modal.com/docs/reference/cli/run) · [`modal shell`](https://modal.com/docs/reference/cli/shell)

### Agent loop pattern

For an AI agent iterating on finetune hyperparameters or eval configs:

1. Ensure worker is up: `modal run research/modal/server_app.py --ping` → `{"status": "ok"}`.
2. If ping fails, human or agent runs `modal deploy research/modal/server_app.py` then `modal run -d research/modal/server_app.py --hours 6`.
3. Agent runs smoke train+eval+gate (no publish yet): `--job math-lora --max-steps 5 --no-publish`.
4. Agent re-evals without retraining: `--eval-only --job math-lora`.
5. Agent reads results: `modal volume get slm-finetune results/lm_eval/math-lora__math ./results/lm_eval/math-lora__math` or `modal volume ls slm-finetune`.
6. Agent adjusts `experiments.yaml`'s `goals`/`max_steps`/`max_samples`, repeats from step 3.
7. Once the gate passes and `hub_org`/`hub_repo` are real: `--publish-only --job math-lora`, or just drop `--no-publish`.
8. When done: `modal app stop slm-gpu-worker` (optional, stops GPU billing from warm pool).

See [`SERVER.md`](SERVER.md) for a structured checklist and error recovery table.

---

## What gets saved on Modal

Modal persists artifacts on [**Volumes**](https://modal.com/docs/guide/volumes) — a distributed filesystem optimized for write-once, read-many workloads like model checkpoints. Files written only to the container disk (outside the mount path) are **not** saved.

| Volume | Mount in container | Contents |
| ------ | ------------------ | -------- |
| `slm-finetune` | `/vol/finetuned` | LoRA adapters, `training_results.json`, lm-eval `results/` |
| `hf-cache` | `/root/.cache/huggingface` | Cached base weights + datasets |

Volumes are created lazily on first run (`create_if_missing=True` in [`finetune_app.py`](finetune_app.py)).

### Commits and visibility

Per the [Volumes guide](https://modal.com/docs/guide/volumes):

- **`volume.commit()`** — persist writes so other containers and `modal volume get` can see them. Our workers call this after each train/eval job.
- **Background commits** — Modal also snapshots attached Volumes every few seconds and on container shutdown, but explicit `commit()` is safest before download.
- **`volume.reload()`** — needed only if the *same* container must see writes from another container without restarting. Each `finetune_one.remote()` / `run_lm_eval.remote()` starts fresh and mounts the latest committed state.

Training writes under `/vol/finetuned/...` (the mount), not `/repo/models/...`. That matches Modal’s [model checkpointing](https://modal.com/docs/guide/volumes#model-checkpointing) pattern: point `finetune.py --out` at the Volume path.

### Per-job adapter layout

Each finetune job writes to a Volume path named after the job (e.g. `math-lora/`).
lm-eval results live under `results/lm_eval/`, named
`<job_name>__<eval_profile>` for candidates and `<preset>__baseline__<eval_profile>`
for the shared per-profile baselines:

```text
slm-finetune (Volume)
├── math-lora/
│   ├── adapter_config.json
│   ├── adapter_model.safetensors   # or adapter_model.bin
│   ├── tokenizer files…
│   ├── training_results.json
│   └── README.md                   # model card, written by publish_adapter
├── science-lora/
├── coding-lora/
├── reasoning-lora/
├── teaching-lora/
├── alpaca-lora/
└── results/lm_eval/
    ├── minicpm5-1b__baseline__math/        # shared by all "math" profile jobs
    ├── minicpm5-1b__baseline__science/
    ├── minicpm5-1b__baseline__instructions/
    ├── math-lora__math/
    ├── science-lora__science/
    └── ...
```

Because `eval_profile` is shared across jobs (e.g. `teaching-lora` and
`alpaca-lora` both use `instructions`), the `instructions` baseline is computed
once per pipeline run and reused for both jobs' gates.

---

## Volume CLI (browse, download, upload)

Official reference: [Modal Volumes guide](https://modal.com/docs/guide/volumes) · [CLI reference](https://modal.com/docs/reference/cli/volume)

### Create or list volumes

```bash
modal volume list
modal volume create slm-finetune    # optional; app creates on first run
modal volume ls slm-finetune
modal volume ls slm-finetune lesson-lora
```

### Browse in a shell

Volumes are mounted under `/mnt` in an interactive shell:

```bash
modal shell --volume slm-finetune
# inside shell:
ls /mnt/slm-finetune
ls /mnt/slm-finetune/lesson-lora
du -sh /mnt/slm-finetune/lesson-lora
```

Use `du` for size — Volumes do not report accurate `df` / `disk_usage()` values ([docs](https://modal.com/docs/guide/volumes#disk-usage-reporting)).

### Download LoRA to your machine

**Use the CLI for adapter weights.** The Modal web UI only supports downloads up to **16 MB** per file; `adapter_model.safetensors` is usually larger ([docs](https://modal.com/docs/guide/volumes#downloading-a-file-from-a-volume)).

```bash
mkdir -p ./models/finetuned

# One job folder → local path expected by models.yaml
modal volume get slm-finetune lesson-lora ./models/finetuned/minicpm5-1b-lora

# lm-eval artifacts
mkdir -p ./results
modal volume get slm-finetune results/lm_eval ./results/lm_eval

# Entire volume (large)
modal volume get slm-finetune / ./modal-artifacts
```

Job folders use the **job name** from `experiments.yaml` (`lesson-lora`), not `minicpm5-1b-lora`. Root [`models.yaml`](../../models.yaml) preset `minicpm5-1b-lesson-lora` expects `./models/finetuned/minicpm5-1b-lora`.

If you downloaded to a different folder name:

```bash
modal volume get slm-finetune lesson-lora ./models/finetuned/lesson-lora
cp -r ./models/finetuned/lesson-lora ./models/finetuned/minicpm5-1b-lora
```

### Upload to a Volume from local

Push a local adapter or merged checkpoint back to Modal ([`modal volume put`](https://modal.com/docs/reference/cli/volume)):

```bash
modal volume put slm-finetune ./models/finetuned/minicpm5-1b-lora lesson-lora
```

Or from Python ([`batch_upload`](https://modal.com/docs/guide/volumes#using-a-volume-from-local-code)):

```python
import modal

vol = modal.Volume.from_name("slm-finetune")
with vol.batch_upload() as batch:
    batch.put_directory(
        "./models/finetuned/minicpm5-1b-lora",
        "/lesson-lora",
    )
```

### Copy within a Volume

```bash
modal volume cp slm-finetune lesson-lora lesson-lora-backup
```

### Parallel training note

With `--parallel`, multiple jobs write to **different** folders on the same Volume. On Volumes v1, avoid more than ~5 concurrent writers/commits ([docs](https://modal.com/docs/guide/volumes#volume-commits-and-reloads)). Prefer sequential runs unless you use Volumes v2 (`modal volume create --version=2`).

---

## Use downloaded weights locally

```bash
# Gradio / inference preset
export ACTIVE_MODEL=minicpm5-1b-lesson-lora

uv run --package gradio-space python -m gradio_space.app

# lm-eval on downloaded adapter
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_smoke.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__local-check
```

### Optional: merge LoRA into full weights locally

Adapters are small; merged weights are easier for some deploy targets.

```bash
uv run python research/finetune.py \
  --merge ./models/finetuned/minicpm5-1b-lora \
  --out ./models/finetuned/minicpm5-1b-lora-merged
```

Then use preset `minicpm5-1b-lesson-merged` or `--model ./models/finetuned/minicpm5-1b-lora-merged`.

---

## Benchmark gate & Hugging Face Hub publish

`finetune_app.py` / `server_app.py` publish adapters to the Hub **automatically**,
but only when a job's lm-eval results pass its `goals`. This is the
"only ship it if it's actually better" gate.

### `goals` schema (per job in `experiments.yaml`)

```yaml
goals:
  task: gsm8k          # lm-eval task name, scored via primary_metric() (same as summary.md)
  min_score: 0.05      # candidate score must be >= this
  min_improve: 0.02    # candidate - baseline must be >= this (baseline = per-profile baseline run)
  guard_tasks:          # optional regression guards — must NOT regress more than max_regress
    - task: arc_challenge
      max_regress: 0.03
```

Publishable jobs also run a **general** eval (`defaults.general_eval_profile`, default
`compare_study`: arc_easy, arc_challenge, hellaswag, piqa, boolq, gsm8k) and must pass
`defaults.general_goals` regression guards so skill tuning does not wash out general
capability. The publish gate requires **both** skill `goals` and `general_goals` to pass.

A job with no `goals` (e.g. `alpaca-lora`) is never gated and never published —
it's local-only (still trained, evaluated, and pulled to your laptop).

### `publish` schema (per job)

```yaml
publish:
  hub_repo: your-hf-username/minicpm5-1b-math-lora
  private: false  # public so judges can verify the Well-Tuned badge; set true to keep it hidden
```

### What happens on a passing gate

1. `run_lm_eval` writes skill results to `results/lm_eval/<job>__<profile>/results.json`.
2. For publishable jobs, a second run writes general results to
   `results/lm_eval/<job>__<general_eval_profile>/results.json`.
3. `check_gate` compares skill results against `results/lm_eval/<preset>__baseline__<profile>/results.json`
   and general results against `results/lm_eval/<preset>__baseline__<general_eval_profile>/results.json`
   using `goals` + `general_goals` → `{"passed": bool, "skill": {...}, "general": {...}, "checks": [...]}`.
4. If `passed` and `publish` is set, `publish_adapter`:
   - renders a model card (`README.md`) into the adapter directory — base model,
     gate checks table, full lm-eval baseline-vs-candidate-vs-delta table,
     training stats, and a PEFT load snippet
   - `huggingface_hub.HfApi().create_repo(..., exist_ok=True)` +
     `upload_folder(...)` to `publish.hub_repo`

If the gate fails, nothing is pushed — rerun with different `max_steps` /
dataset / `goals`, then `modal run research/modal/finetune_app.py::publish_only --job <name>`
once it passes (re-checks the gate against the latest results before publishing).

### Setup

```bash
huggingface-cli login
# or: export HF_TOKEN=hf_...   (needs write access; same token as `modal secret create huggingface`)
```

Set real values for `defaults.hub_org` and each job's `publish.hub_repo` in
`experiments.yaml` before running with `--publish` (the default). Repos are
created automatically (`exist_ok=True`) — no need to pre-create them on huggingface.co.

---

## Manual Hugging Face Hub publish (fallback)

Use this if you'd rather download an adapter and push it yourself — e.g. for
**merged full weights**, or adapters trained before the gate/publish pipeline
existed.

### Prerequisites

```bash
huggingface-cli login
# or: export HF_TOKEN=hf_...
```

Create an empty model repo on Hugging Face (e.g. `your-user/minicpm5-1b-lesson-lora`).

### Option A — Upload LoRA adapter (recommended)

After `modal volume get`:

```bash
ADAPTER=./models/finetuned/minicpm5-1b-lora
REPO=your-user/minicpm5-1b-lesson-lora

huggingface-cli upload "$REPO" "$ADAPTER" . \
  --repo-type model \
  --commit-message "Lesson LoRA from Modal finetune"
```

Add a minimal `README.md` in the adapter folder before upload (or edit on the Hub) documenting the base model:

```markdown
# MiniCPM5-1B lesson LoRA

- Base model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- Dataset: education lesson chat (Build Small hackathon)
- Load with PEFT: `PeftModel.from_pretrained(base, "your-user/minicpm5-1b-lesson-lora")`
```

**Load from Hub in Python:**

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "openbmb/MiniCPM5-1B"
adapter = "your-user/minicpm5-1b-lesson-lora"

tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)
```

### Option B — Upload merged weights

```bash
uv run python research/finetune.py \
  --merge ./models/finetuned/minicpm5-1b-lora \
  --out ./models/finetuned/minicpm5-1b-lora-merged

huggingface-cli upload your-user/minicpm5-1b-lesson-merged \
  ./models/finetuned/minicpm5-1b-lora-merged . \
  --repo-type model
```

Consumers set `MODEL_ID=your-user/minicpm5-1b-lesson-merged` with no adapter.

### Option C — Upload from Modal shell (no local download)

Browse the Volume in a shell ([docs](https://modal.com/docs/guide/volumes#using-a-volume-from-outside-of-modal)):

```bash
modal shell --volume slm-finetune
```

Inside the shell (volume at `/mnt/slm-finetune`):

```bash
pip install huggingface_hub
export HF_TOKEN=...   # write token
huggingface-cli upload your-user/minicpm5-1b-lesson-lora \
  /mnt/slm-finetune/lesson-lora . --repo-type model
```

Downloading to your laptop first (Option A) is usually easier to review before publish.

### Use on Hugging Face Space

**LoRA on Space (Gradio SDK):**

1. Upload adapter repo (Option A).
2. In Space **Settings → Repository secrets**, set `HF_TOKEN` if the base model needs it.
3. In Space env vars:

```bash
ACTIVE_MODEL=minicpm5-1b
# Override adapter via custom preset or env — e.g. add to models.yaml on Space:
# adapter_path: your-user/minicpm5-1b-lesson-lora  # Hub id works if peft resolves it
```

For the shipped Space, the reliable path is: download adapter → commit into repo under `models/finetuned/` → `ACTIVE_MODEL=minicpm5-1b-lesson-lora`, or upload **merged** weights and point `MODEL_ID` at your Hub repo.

**Merged on Space:**

```bash
ACTIVE_MODEL=custom
MODEL_ID=your-user/minicpm5-1b-lesson-merged
TRUST_REMOTE_CODE=true
```

---

## Modal Notebooks (interactive GPU)

Official guide: [Modal Notebooks](https://modal.com/docs/guide/notebooks)

Use a hosted Jupyter kernel on Modal for demos, pair programming, and quick experiments. For reproducible sweeps and CI-style runs, prefer `modal run research/modal/finetune_app.py`.

### Getting started

1. Open [modal.com/notebooks](https://modal.com/notebooks) and **upload** [`research/notebook/minicpm5-modal-finetune.ipynb`](../notebook/minicpm5-modal-finetune.ipynb) (or create a notebook and copy the cells).
2. In the **sidebar → Compute profile**, enable a **GPU** (e.g. A10G). Notebooks are serverless: you pay only while the kernel runs; idle shutdown defaults to 10 minutes.
3. Attach resources in the sidebar **Files** panel:
   - **Volume** `slm-finetune` → appears under `/mnt/slm-finetune` (share checkpoints with `modal run` jobs)
   - **Secret** `huggingface` → injects `HF_TOKEN` for Hub downloads
4. Run cells top to bottom.

The default notebook image includes PyTorch, Transformers, and NumPy. Install extras with:

```python
%uv pip install uv peft bitsandbytes datasets
```

### Persist checkpoints on a Volume

The container filesystem is **ephemeral**. Anything under `/root` is lost when the kernel stops. Write adapters to an attached Volume:

```python
OUT = "/mnt/slm-finetune/lesson-lora-notebook"  # survives kernel restarts
```

After training, download from the **Files** panel (⬇) or locally:

```bash
modal volume get slm-finetune lesson-lora-notebook ./models/finetuned/minicpm5-1b-lora
```

### Custom image (optional, full repo deps)

To match the `modal run` environment exactly, deploy the app image once:

```bash
modal deploy research/modal/finetune_app.py
```

Then in the notebook sidebar, search for function `finetune_one` from app `slm-finetune-benchmark` and select that image as the kernel.

Or call deployed functions from a cell with [`%modal` magic](https://modal.com/docs/guide/notebooks#cell-magic):

```python
%modal from slm-finetune-benchmark import finetune_one

finetune_one.remote({
    "name": "lesson-lora",
    "dataset": "research/data/education-lesson-chat.jsonl",
    "format": "chat",
    "max_steps": 20,
})
```

(Requires `modal deploy` and the repo baked into the image.)

### Share for hackathon judges

Use **Share** in the notebook editor → **public unlisted link** → **Can view and run** so reviewers can fork and execute without a Modal account ([docs](https://modal.com/docs/guide/notebooks#access-and-sharing)).

### Notebook vs `modal run`

| | Modal Notebook | `modal run finetune_app.py` |
| --- | --- | --- |
| Best for | Demo video, exploration | Reproducible sweep, Volume + lm-eval pipeline |
| GPU | Sidebar compute profile | `gpu="A10G"` on functions |
| Persistence | Attach Volume in sidebar | `slm-finetune` Volume auto-mounted |
| Cost | Per kernel uptime | Per function invocation |

---

## Architecture

```mermaid
flowchart LR
  subgraph batch [finetune_app.py — batch]
    laptop1["modal run finetune_app\n--job/--category"] --> base["run_lm_eval\n(per-profile baseline)"]
    laptop1 --> train["finetune_one"]
    train --> eval["run_lm_eval\n(candidate)"]
    eval --> gate["check_gate\n(goals)"]
    gate -- passed --> pub["publish_adapter"]
  end
  subgraph worker [server_app.py — warm loop]
    laptop2["modal run server_app\n--job/--category/--pipeline"] --> gpu["GpuWorker A10G"]
    gpu --> rp["run_pipeline\n(baseline -> train -> eval -> gate -> publish)"]
  end
  base --> vol["Volume slm-finetune"]
  train --> vol
  eval --> vol
  gate --> vol
  rp --> vol
  gpu --> hfc["Volume hf-cache"]
  pub --> hub["Hugging Face Hub\n(publish.hub_repo)"]
  rp --> hub
  vol --> get["modal volume get\n(pull)"]
  get --> local["models/finetuned/<job>"]
  local --> space["HF Space ACTIVE_MODEL"]
```

| Resource | Role |
| -------- | ---- |
| App `slm-finetune-benchmark` | One-shot batch pipeline (`finetune_app.py`): `main`, `publish_only`, `pull` |
| App `slm-gpu-worker` | Long-lived GPU worker (`server_app.py`): `GpuWorker.run_pipeline` |
| GPU `A10G` (or per-job `gpu:` override) | Default for train + eval |
| Secret `huggingface` | `HF_TOKEN` for HF downloads + Hub publish |
| [`_common.py`](_common.py) | Shared image, volumes, command builders, gate (`evaluate_gate`/`check_gate_files`), publish (`publish_adapter_files`, `render_model_card`) |
| [`experiments.yaml`](experiments.yaml) | Skill matrix: jobs, `eval_profile`, `goals`, `publish` |
| [`eval_profiles.yaml`](../evals/configs/eval_profiles.yaml) | Maps `eval_profile` → lm-eval config + task list |
| [`finetune.py`](../finetune.py) | Training logic (unchanged) |
| `slm-lm-eval` | Academic benchmarks |

---

## Troubleshooting

| Symptom | Fix |
| ------- | --- |
| `Secret huggingface not found` | `modal secret create huggingface HF_TOKEN=...` |
| Volume empty after run | Job may have failed; `modal volume ls slm-finetune`; ensure writes went to `/vol/finetuned` not `/repo` |
| `modal volume get` missing files | Call `commit()` completed; for same-container reads use `volume.reload()` |
| Large file won't download in UI | Use `modal volume get` CLI (16 MB UI limit) |
| `modal volume get` path wrong | Job name = top-level folder (e.g. `math-lora`, not `minicpm5-1b-lora`) |
| Gate fails / `published: false, reason: "gate failed"` | Check `gate.checks` in the output; adjust `goals` (`min_score`/`min_improve`/`guard_tasks`), `max_steps`, or dataset, then rerun |
| `published: false, reason: "no publish config..."` | Job has no `publish:` block in `experiments.yaml` (intentional for local-only jobs like `alpaca-lora`) |
| `Unknown eval_profile ...` | Check `eval_profile` in `experiments.yaml` matches a key in `research/evals/configs/eval_profiles.yaml` |
| Hub upload 403 | Use a write `HF_TOKEN`; repos are created automatically (`exist_ok=True`), no need to pre-create |
| Still publishing to `your-hf-username/...` | Edit `defaults.hub_org` and each job's `publish.hub_repo` in `experiments.yaml` |
| Space cannot find adapter | Use merged weights or copy adapter into repo `models/finetuned/` |
| Image build slow | `hf-cache` Volume caches weights across runs |
| OOM on GPU | `--mode qlora` in `experiments.yaml`; lower `max_len` in finetune; or set a per-job `gpu:` with more VRAM |
| `scaledown_window` deploy error | Must be 2–3600s (we use 3600); see `_common.py` |
| `server_app` ping fails | `modal deploy research/modal/server_app.py`; start keep-alive: `modal run -d research/modal/server_app.py` |
| Jobs hit different containers | Deploy first; use `server_app.py` not `finetune_app.py` for warm loop |
| Worker still billing after done | `modal app stop slm-gpu-worker` |

---

## Hackathon checklist

1. Link or screenshot of Modal app run (`slm-finetune-benchmark` or `slm-gpu-worker`), including the `--- summary ---` table (skill, category, gate, published, hub_repo).
2. `results/lm_eval/<job>__<profile>/comparison.md` — baseline vs candidate per skill.
3. At least one adapter with `goals` that passed the gate and published to the Hub (model card auto-generated).
4. Adapter on Volume or Hub + `ACTIVE_MODEL=minicpm5-1b-<skill>-lora` on Space.
5. Optional: Notebook recording of smoke train cell.

See also: [`SERVER.md`](SERVER.md) · [research/USAGE.md](../USAGE.md) · [Modal Volumes](https://modal.com/docs/guide/volumes) · [Modal Notebooks](https://modal.com/docs/guide/notebooks) · [Modal CUDA](https://modal.com/docs/guide/cuda)