Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /modal /README.md

MSG

Feat/last hour (#24)

bbff1ca 19 days ago

preview code

Raw

History Blame Contribute Delete

33.4 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Modal finetune + benchmark

GPU fine-tuning + benchmarking + Hub publishing on Modal for openbmb/MiniCPM5-1B, wrapping existing research/finetune.py and slm-lm-eval.

Use this when you have no local CUDA but want a hackathon-quality train → eval → gate → publish loop for a whole skill matrix of QLoRA adapters (math, science, coding, reasoning, teaching, instructions).

Track	What you ship
Modal	`modal run` skill-matrix pipeline, Volume artifacts, optional Modal Notebook
Well-Tuned	Per-skill before/after `lm-eval` + gated Hub publish for each LoRA

Layout

research/modal/
├── _common.py         # Shared image, volumes, command builders, gate + publish helpers
├── finetune_app.py    # One-shot batch pipeline (slm-finetune-benchmark): main, publish_only, pull
├── server_app.py      # Long-lived GPU worker (slm-gpu-worker): GpuWorker.run_pipeline
├── experiments.yaml   # Skill matrix: jobs, eval_profile, goals, publish
├── README.md          # Full Modal docs (this file)
└── SERVER.md          # Human + AI agent loop runbook (quick reference)

Interactive path: research/notebook/minicpm5-modal-finetune.ipynb (Modal GPU Notebook).

Which app to use

App	CLI	Best for
`finetune_app.py`	`modal run research/modal/finetune_app.py`	Full sweep, CI-style batch, parallel jobs
`server_app.py`	`modal deploy` + `modal run research/modal/server_app.py`	Multi-hour session, iterative human/AI loops on one warm GPU

Both apps share _common.py: same image, hf-cache / slm-finetune volumes, and wrappers around research/finetune.py + slm-lm-eval.

One-time setup

# Modal CLI + auth
pip install modal
modal setup

# HF token (downloads + Hub upload). Same token as huggingface-cli login.
modal secret create huggingface HF_TOKEN=<your-hf-token>

# Optional: validate deps before first image build
uv sync --group finetune --group lm-eval --package slm-evals
uv sync --group modal   # local orchestration only

HF_TOKEN must be a write token if you plan to push adapters to the Hub.

Run training + benchmarks

All commands from repo root. finetune_app.py runs the full skill-matrix pipeline: per-profile base-model baseline lm-eval (no adapter) → finetune each job's QLoRA adapter → post-train lm-eval vs. that baseline → check goals (gate) → publish to the Hugging Face Hub if the gate passes → pull adapter + results to your laptop.

# Full sweep: every job in experiments.yaml
modal run research/modal/finetune_app.py

# One skill (cheap smoke run)
modal run research/modal/finetune_app.py --job math-lora --max-steps 20

# One category (e.g. all "science" jobs)
modal run research/modal/finetune_app.py --category science

# Re-run lm-eval (+ gate + publish) only — adapter already on Volume
modal run research/modal/finetune_app.py --eval-only --job math-lora

# Train + eval but skip the Hub push and the local download
modal run research/modal/finetune_app.py --no-publish --no-pull

# Train/eval jobs in parallel (one GPU per job — higher cost)
modal run research/modal/finetune_app.py --parallel

# Re-run just the gate + Hub publish for an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora

# Pull adapters + lm-eval results for a category without re-running anything
modal run research/modal/finetune_app.py::pull --category math

Jobs live in experiments.yaml — a skill matrix, one QLoRA adapter per category, each evaluated against the matching eval_profile from research/evals/configs/eval_profiles.yaml:

Job	Category	Dataset (format)	Eval profile	`goals` task	Publish
`teaching-lora`	teaching	`research/data/education-lesson-chat.jsonl` (`chat`)	`instructions`	`ifeval`	✅
`science-lora`	science	`research/data/science-tutor-chat.jsonl` (`chat`)	`science`	`sciq` (+ `arc_challenge` guard)	✅
`math-lora`	math	`TIGER-Lab/MathInstruct` (`alpaca`)	`math`	`gsm8k` (+ `arc_challenge` guard)	✅
`coding-lora`	coding	`iamtarun/python_code_instructions_18k_alpaca` (`alpaca`)	`code`	`mbpp`	✅
`reasoning-lora`	reasoning	`HuggingFaceTB/smoltalk` (`chat`)	`reasoning`	`gsm8k` (+ `hellaswag` guard)	✅
`language-lesson-lora`	language	`language-lesson-fr/ar.jsonl` (`chat`)	`multilingual`	`xnli` (+ `hellaswag` guard)	✅
`french-lora`	french	`FrancophonIA/english_french` (`prompt`) + FR chat	`french`	`french_bench_xnli` (+ `hellaswag` guard)	✅
`alpaca-lora`	instructions	`tatsu-lab/alpaca` (`alpaca`)	`instructions`	— (no `goals`)	local-only

Before publishing, replace defaults.hub_org and each job's publish.hub_repo in experiments.yaml with your Hugging Face username/org (defaults to the placeholder your-hf-username).

Edit defaults.max_steps, per-job gpu, or per-job max_samples / dataset_split in experiments.yaml to balance cost vs quality. See Benchmark gate & Hugging Face Hub publish for the goals/publish schema.

CLI flags (`finetune_app.py`)

main (default entrypoint — full pipeline):

Flag	Default	Meaning
`--train` / `--no-train`	train on	Run finetune jobs
`--eval-only`	off	Skip train; eval existing Volume checkpoints (still runs missing base-model baselines)
`--parallel`	off	`finetune_one.spawn()` per job instead of sequential
`--job`	all jobs	Run one job name from `experiments.yaml`
`--category`	all categories	Run all jobs with this `category`
`--max-steps`	from YAML	Override training steps
`--publish` / `--no-publish`	publish on	Push to `publish.hub_repo` if the gate passes
`--pull` / `--no-pull`	pull on	`modal volume get` the adapter + lm-eval results after each job

publish_only (separate entrypoint — ::publish_only):

Flag	Default	Meaning
`--job`	required	Re-check the gate against existing results and publish if it passes

pull (separate entrypoint — ::pull):

Flag	Default	Meaning
`--job`	—	Pull one job's adapter + results
`--category`	—	Pull all jobs in a category
`--dest`	`models/finetuned`	Local destination directory

GPU worker (`server_app.py`) — human + AI agent loops

Use this when you want one warm A10G container for several hours and many train/eval commands without reinstalling deps or re-downloading HF weights each time.

Quick runbook: see SERVER.md (copy-paste commands for humans and coding agents).

Deploy once

modal deploy research/modal/server_app.py

App name: slm-gpu-worker. Dashboard: modal app list or the URL printed after deploy.

GpuWorker keeps min_containers=1 while deployed, mounts hf-cache + slm-finetune, and reuses the same container for sequential .remote() calls when possible.

Two-terminal loop (recommended)

Terminal 1 — keep worker alive (default 4h; blocks unless detached):

modal run research/modal/server_app.py
# or free your terminal:
modal run -d research/modal/server_app.py --hours 6

Terminal 2 — run experiments on the warm GPU (repeat as often as you like):

# Full skill-matrix pipeline for one job on the warm container:
# per-profile baseline → train → eval → gate → publish → pull
modal run research/modal/server_app.py --job math-lora --max-steps 20

# All jobs in a category
modal run research/modal/server_app.py --category science

# Whole matrix, but skip the Hub push
modal run research/modal/server_app.py --pipeline --no-publish

# Re-eval (+ gate + publish) an existing adapter on Volume
modal run research/modal/server_app.py --eval-only --job math-lora

# Re-check the gate and publish using already-computed results
modal run research/modal/server_app.py --publish-only --job math-lora

# Arbitrary command in /repo (same env as finetune.py)
modal run research/modal/server_app.py --cmd "uv run python research/finetune.py --help"

# Health check
modal run research/modal/server_app.py --ping

Task flags (--job, --category, --cmd, --pipeline, --eval-only, --publish-only, --ping) automatically disable the default keep-alive mode.

CLI flags (`server_app.py`)

Flag	Default	Meaning
(none)	`serve=True`	Keep `GpuWorker` alive (`keep_alive`)
`--hours`	`4`	Keep-alive duration
`--no-serve`	—	Skip keep-alive (auto when any task flag is set)
`--job`	—	Run the skill-matrix pipeline for one job
`--category`	—	Run the skill-matrix pipeline for all jobs in a category
`--pipeline`	off	Run the skill-matrix pipeline for all jobs
`--max-steps`	from YAML	Override training steps
`--eval-only`	off	Pipeline eval/gate/publish only (skip train; still runs missing base-model baselines)
`--publish` / `--no-publish`	publish on	Push to `publish.hub_repo` if the gate passes
`--publish-only`	off	Re-check the gate against existing results and publish (requires `--job`)
`--pull` / `--no-pull`	pull on	`modal volume get` adapter + results after the pipeline
`--cmd`	—	Shell command (parsed with `shlex`)
`--ping`	off	Return worker status JSON

`GpuWorker` methods (for notebooks / Python callers)

After modal deploy, call from Python:

import modal

Worker = modal.Cls.from_name("slm-gpu-worker", "GpuWorker")
w = Worker()

w.ping.remote()
w.finetune.remote({"name": "math-lora", "dataset": "...", "format": "alpaca", "max_steps": 20})
w.lm_eval.remote(experiment_name="math-lora__math", config="research/evals/configs/lm_eval_math.yaml", adapter_path="/vol/finetuned/math-lora")
w.exec_cmd.remote(["uv", "run", "python", "research/finetune.py", "--help"])
w.run_pipeline.remote(job_names=["math-lora"], max_steps=20)

# Gate + publish (only pushes to the Hub if gate_result["passed"])
gate = w.check_gate.remote(
    candidate_results_path="/vol/finetuned/results/lm_eval/math-lora__math/results.json",
    baseline_results_path="/vol/finetuned/results/lm_eval/minicpm5-1b__baseline__math/results.json",
    goals={"task": "gsm8k", "min_score": 0.05, "min_improve": 0.02},
)
w.publish_adapter.remote(job=..., adapter_dir="/vol/finetuned/math-lora", gate_result=gate, ...)

Inside the class, run_pipeline chains lm_eval (baselines) → finetune → lm_eval (candidate) → check_gate → publish_adapter via .local(), so everything runs in the same container without extra cold starts.

Persistence (what survives between commands)

Layer	Survives	Notes
Image (`uv sync` baked in)	Across all runs	Rebuilds only when image definition changes
`hf-cache` Volume	Across runs	Base weights + datasets; committed after each job
`slm-finetune` Volume	Across runs	Adapters + lm-eval results
Warm container	While deployed + idle < `scaledown_window`	`min_containers=1`; max idle grace 3600s (Modal limit)
`keep_alive` loop	Up to `--hours`	Container stays active; no scale-down during loop

Stop / logs

modal app logs slm-gpu-worker -f          # stream logs
modal app stop slm-gpu-worker             # stop deployed app + warm pool
modal app stop slm-gpu-worker -y          # no confirmation prompt

Refs: modal app · modal run · modal shell

Agent loop pattern

For an AI agent iterating on finetune hyperparameters or eval configs:

Ensure worker is up: modal run research/modal/server_app.py --ping → {"status": "ok"}.
If ping fails, human or agent runs modal deploy research/modal/server_app.py then modal run -d research/modal/server_app.py --hours 6.
Agent runs smoke train+eval+gate (no publish yet): --job math-lora --max-steps 5 --no-publish.
Agent re-evals without retraining: --eval-only --job math-lora.
Agent reads results: modal volume get slm-finetune results/lm_eval/math-lora__math ./results/lm_eval/math-lora__math or modal volume ls slm-finetune.
Agent adjusts experiments.yaml's goals/max_steps/max_samples, repeats from step 3.
Once the gate passes and hub_org/hub_repo are real: --publish-only --job math-lora, or just drop --no-publish.
When done: modal app stop slm-gpu-worker (optional, stops GPU billing from warm pool).

See SERVER.md for a structured checklist and error recovery table.

What gets saved on Modal

Modal persists artifacts on Volumes — a distributed filesystem optimized for write-once, read-many workloads like model checkpoints. Files written only to the container disk (outside the mount path) are not saved.

Volume	Mount in container	Contents
`slm-finetune`	`/vol/finetuned`	LoRA adapters, `training_results.json`, lm-eval `results/`
`hf-cache`	`/root/.cache/huggingface`	Cached base weights + datasets

Volumes are created lazily on first run (create_if_missing=True in finetune_app.py).

Commits and visibility

Per the Volumes guide:

volume.commit() — persist writes so other containers and modal volume get can see them. Our workers call this after each train/eval job.
Background commits — Modal also snapshots attached Volumes every few seconds and on container shutdown, but explicit commit() is safest before download.
volume.reload() — needed only if the same container must see writes from another container without restarting. Each finetune_one.remote() / run_lm_eval.remote() starts fresh and mounts the latest committed state.

Training writes under /vol/finetuned/... (the mount), not /repo/models/.... That matches Modal’s model checkpointing pattern: point finetune.py --out at the Volume path.

Per-job adapter layout

Each finetune job writes to a Volume path named after the job (e.g. math-lora/). lm-eval results live under results/lm_eval/, named <job_name>__<eval_profile> for candidates and <preset>__baseline__<eval_profile> for the shared per-profile baselines:

slm-finetune (Volume)
├── math-lora/
│   ├── adapter_config.json
│   ├── adapter_model.safetensors   # or adapter_model.bin
│   ├── tokenizer files…
│   ├── training_results.json
│   └── README.md                   # model card, written by publish_adapter
├── science-lora/
├── coding-lora/
├── reasoning-lora/
├── teaching-lora/
├── alpaca-lora/
└── results/lm_eval/
    ├── minicpm5-1b__baseline__math/        # shared by all "math" profile jobs
    ├── minicpm5-1b__baseline__science/
    ├── minicpm5-1b__baseline__instructions/
    ├── math-lora__math/
    ├── science-lora__science/
    └── ...

Because eval_profile is shared across jobs (e.g. teaching-lora and alpaca-lora both use instructions), the instructions baseline is computed once per pipeline run and reused for both jobs' gates.

Volume CLI (browse, download, upload)

Official reference: Modal Volumes guide · CLI reference

Create or list volumes

modal volume list
modal volume create slm-finetune    # optional; app creates on first run
modal volume ls slm-finetune
modal volume ls slm-finetune lesson-lora

Browse in a shell

Volumes are mounted under /mnt in an interactive shell:

modal shell --volume slm-finetune
# inside shell:
ls /mnt/slm-finetune
ls /mnt/slm-finetune/lesson-lora
du -sh /mnt/slm-finetune/lesson-lora

Use du for size — Volumes do not report accurate df / disk_usage() values (docs).

Download LoRA to your machine

Use the CLI for adapter weights. The Modal web UI only supports downloads up to 16 MB per file; adapter_model.safetensors is usually larger (docs).

mkdir -p ./models/finetuned

# One job folder → local path expected by models.yaml
modal volume get slm-finetune lesson-lora ./models/finetuned/minicpm5-1b-lora

# lm-eval artifacts
mkdir -p ./results
modal volume get slm-finetune results/lm_eval ./results/lm_eval

# Entire volume (large)
modal volume get slm-finetune / ./modal-artifacts

Job folders use the job name from experiments.yaml (lesson-lora), not minicpm5-1b-lora. Root models.yaml preset minicpm5-1b-lesson-lora expects ./models/finetuned/minicpm5-1b-lora.

If you downloaded to a different folder name:

modal volume get slm-finetune lesson-lora ./models/finetuned/lesson-lora
cp -r ./models/finetuned/lesson-lora ./models/finetuned/minicpm5-1b-lora

Upload to a Volume from local

Push a local adapter or merged checkpoint back to Modal (modal volume put):

modal volume put slm-finetune ./models/finetuned/minicpm5-1b-lora lesson-lora

Or from Python (batch_upload):

import modal

vol = modal.Volume.from_name("slm-finetune")
with vol.batch_upload() as batch:
    batch.put_directory(
        "./models/finetuned/minicpm5-1b-lora",
        "/lesson-lora",
    )

Copy within a Volume

modal volume cp slm-finetune lesson-lora lesson-lora-backup

Parallel training note

With --parallel, multiple jobs write to different folders on the same Volume. On Volumes v1, avoid more than ~5 concurrent writers/commits (docs). Prefer sequential runs unless you use Volumes v2 (modal volume create --version=2).

Use downloaded weights locally

# Gradio / inference preset
export ACTIVE_MODEL=minicpm5-1b-lesson-lora

uv run --package gradio-space python -m gradio_space.app

# lm-eval on downloaded adapter
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_smoke.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__local-check

Optional: merge LoRA into full weights locally

Adapters are small; merged weights are easier for some deploy targets.

uv run python research/finetune.py \
  --merge ./models/finetuned/minicpm5-1b-lora \
  --out ./models/finetuned/minicpm5-1b-lora-merged

Then use preset minicpm5-1b-lesson-merged or --model ./models/finetuned/minicpm5-1b-lora-merged.

Benchmark gate & Hugging Face Hub publish

finetune_app.py / server_app.py publish adapters to the Hub automatically, but only when a job's lm-eval results pass its goals. This is the "only ship it if it's actually better" gate.

`goals` schema (per job in `experiments.yaml`)

goals:
  task: gsm8k          # lm-eval task name, scored via primary_metric() (same as summary.md)
  min_score: 0.05      # candidate score must be >= this
  min_improve: 0.02    # candidate - baseline must be >= this (baseline = per-profile baseline run)
  guard_tasks:          # optional regression guards — must NOT regress more than max_regress
    - task: arc_challenge
      max_regress: 0.03

Publishable jobs also run a general eval (defaults.general_eval_profile, default compare_study: arc_easy, arc_challenge, hellaswag, piqa, boolq, gsm8k) and must pass defaults.general_goals regression guards so skill tuning does not wash out general capability. The publish gate requires both skill goals and general_goals to pass.

A job with no goals (e.g. alpaca-lora) is never gated and never published — it's local-only (still trained, evaluated, and pulled to your laptop).

`publish` schema (per job)

publish:
  hub_repo: your-hf-username/minicpm5-1b-math-lora
  private: false  # public so judges can verify the Well-Tuned badge; set true to keep it hidden

What happens on a passing gate

run_lm_eval writes skill results to results/lm_eval/<job>__<profile>/results.json.
For publishable jobs, a second run writes general results to results/lm_eval/<job>__<general_eval_profile>/results.json.
check_gate compares skill results against results/lm_eval/<preset>__baseline__<profile>/results.json and general results against results/lm_eval/<preset>__baseline__<general_eval_profile>/results.json using goals + general_goals → {"passed": bool, "skill": {...}, "general": {...}, "checks": [...]}.
If passed and publish is set, publish_adapter:
- renders a model card (README.md) into the adapter directory — base model, gate checks table, full lm-eval baseline-vs-candidate-vs-delta table, training stats, and a PEFT load snippet
- huggingface_hub.HfApi().create_repo(..., exist_ok=True) + upload_folder(...) to publish.hub_repo

If the gate fails, nothing is pushed — rerun with different max_steps / dataset / goals, then modal run research/modal/finetune_app.py::publish_only --job <name> once it passes (re-checks the gate against the latest results before publishing).

Setup

huggingface-cli login
# or: export HF_TOKEN=hf_...   (needs write access; same token as `modal secret create huggingface`)

Set real values for defaults.hub_org and each job's publish.hub_repo in experiments.yaml before running with --publish (the default). Repos are created automatically (exist_ok=True) — no need to pre-create them on huggingface.co.

Manual Hugging Face Hub publish (fallback)

Use this if you'd rather download an adapter and push it yourself — e.g. for merged full weights, or adapters trained before the gate/publish pipeline existed.

Prerequisites

huggingface-cli login
# or: export HF_TOKEN=hf_...

Create an empty model repo on Hugging Face (e.g. your-user/minicpm5-1b-lesson-lora).

Option A — Upload LoRA adapter (recommended)

After modal volume get:

ADAPTER=./models/finetuned/minicpm5-1b-lora
REPO=your-user/minicpm5-1b-lesson-lora

huggingface-cli upload "$REPO" "$ADAPTER" . \
  --repo-type model \
  --commit-message "Lesson LoRA from Modal finetune"

Add a minimal README.md in the adapter folder before upload (or edit on the Hub) documenting the base model:

# MiniCPM5-1B lesson LoRA

- Base model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- Dataset: education lesson chat (Build Small hackathon)
- Load with PEFT: `PeftModel.from_pretrained(base, "your-user/minicpm5-1b-lesson-lora")`

Load from Hub in Python:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "openbmb/MiniCPM5-1B"
adapter = "your-user/minicpm5-1b-lesson-lora"

tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)

Option B — Upload merged weights

uv run python research/finetune.py \
  --merge ./models/finetuned/minicpm5-1b-lora \
  --out ./models/finetuned/minicpm5-1b-lora-merged

huggingface-cli upload your-user/minicpm5-1b-lesson-merged \
  ./models/finetuned/minicpm5-1b-lora-merged . \
  --repo-type model

Consumers set MODEL_ID=your-user/minicpm5-1b-lesson-merged with no adapter.

Option C — Upload from Modal shell (no local download)

Browse the Volume in a shell (docs):

modal shell --volume slm-finetune

Inside the shell (volume at /mnt/slm-finetune):

pip install huggingface_hub
export HF_TOKEN=...   # write token
huggingface-cli upload your-user/minicpm5-1b-lesson-lora \
  /mnt/slm-finetune/lesson-lora . --repo-type model

Downloading to your laptop first (Option A) is usually easier to review before publish.

Use on Hugging Face Space

LoRA on Space (Gradio SDK):

Upload adapter repo (Option A).
In Space Settings → Repository secrets, set HF_TOKEN if the base model needs it.
In Space env vars:

ACTIVE_MODEL=minicpm5-1b
# Override adapter via custom preset or env — e.g. add to models.yaml on Space:
# adapter_path: your-user/minicpm5-1b-lesson-lora  # Hub id works if peft resolves it

For the shipped Space, the reliable path is: download adapter → commit into repo under models/finetuned/ → ACTIVE_MODEL=minicpm5-1b-lesson-lora, or upload merged weights and point MODEL_ID at your Hub repo.

Merged on Space:

ACTIVE_MODEL=custom
MODEL_ID=your-user/minicpm5-1b-lesson-merged
TRUST_REMOTE_CODE=true

Modal Notebooks (interactive GPU)

Official guide: Modal Notebooks

Use a hosted Jupyter kernel on Modal for demos, pair programming, and quick experiments. For reproducible sweeps and CI-style runs, prefer modal run research/modal/finetune_app.py.

Getting started

Open modal.com/notebooks and upload research/notebook/minicpm5-modal-finetune.ipynb (or create a notebook and copy the cells).
In the sidebar → Compute profile, enable a GPU (e.g. A10G). Notebooks are serverless: you pay only while the kernel runs; idle shutdown defaults to 10 minutes.
Attach resources in the sidebar Files panel:
- Volume slm-finetune → appears under /mnt/slm-finetune (share checkpoints with modal run jobs)
- Secret huggingface → injects HF_TOKEN for Hub downloads
Run cells top to bottom.

The default notebook image includes PyTorch, Transformers, and NumPy. Install extras with:

%uv pip install uv peft bitsandbytes datasets

Persist checkpoints on a Volume

The container filesystem is ephemeral. Anything under /root is lost when the kernel stops. Write adapters to an attached Volume:

OUT = "/mnt/slm-finetune/lesson-lora-notebook"  # survives kernel restarts

After training, download from the Files panel (⬇) or locally:

modal volume get slm-finetune lesson-lora-notebook ./models/finetuned/minicpm5-1b-lora

Custom image (optional, full repo deps)

To match the modal run environment exactly, deploy the app image once:

modal deploy research/modal/finetune_app.py

Then in the notebook sidebar, search for function finetune_one from app slm-finetune-benchmark and select that image as the kernel.

Or call deployed functions from a cell with %modal magic:

%modal from slm-finetune-benchmark import finetune_one

finetune_one.remote({
    "name": "lesson-lora",
    "dataset": "research/data/education-lesson-chat.jsonl",
    "format": "chat",
    "max_steps": 20,
})

(Requires modal deploy and the repo baked into the image.)

Share for hackathon judges

Use Share in the notebook editor → public unlisted link → Can view and run so reviewers can fork and execute without a Modal account (docs).

Notebook vs `modal run`

	Modal Notebook	`modal run finetune_app.py`
Best for	Demo video, exploration	Reproducible sweep, Volume + lm-eval pipeline
GPU	Sidebar compute profile	`gpu="A10G"` on functions
Persistence	Attach Volume in sidebar	`slm-finetune` Volume auto-mounted
Cost	Per kernel uptime	Per function invocation

Architecture

flowchart LR
  subgraph batch [finetune_app.py — batch]
    laptop1["modal run finetune_app\n--job/--category"] --> base["run_lm_eval\n(per-profile baseline)"]
    laptop1 --> train["finetune_one"]
    train --> eval["run_lm_eval\n(candidate)"]
    eval --> gate["check_gate\n(goals)"]
    gate -- passed --> pub["publish_adapter"]
  end
  subgraph worker [server_app.py — warm loop]
    laptop2["modal run server_app\n--job/--category/--pipeline"] --> gpu["GpuWorker A10G"]
    gpu --> rp["run_pipeline\n(baseline -> train -> eval -> gate -> publish)"]
  end
  base --> vol["Volume slm-finetune"]
  train --> vol
  eval --> vol
  gate --> vol
  rp --> vol
  gpu --> hfc["Volume hf-cache"]
  pub --> hub["Hugging Face Hub\n(publish.hub_repo)"]
  rp --> hub
  vol --> get["modal volume get\n(pull)"]
  get --> local["models/finetuned/<job>"]
  local --> space["HF Space ACTIVE_MODEL"]

Resource	Role
App `slm-finetune-benchmark`	One-shot batch pipeline (`finetune_app.py`): `main`, `publish_only`, `pull`
App `slm-gpu-worker`	Long-lived GPU worker (`server_app.py`): `GpuWorker.run_pipeline`
GPU `A10G` (or per-job `gpu:` override)	Default for train + eval
Secret `huggingface`	`HF_TOKEN` for HF downloads + Hub publish
`_common.py`	Shared image, volumes, command builders, gate (`evaluate_gate`/`check_gate_files`), publish (`publish_adapter_files`, `render_model_card`)
`experiments.yaml`	Skill matrix: jobs, `eval_profile`, `goals`, `publish`
`eval_profiles.yaml`	Maps `eval_profile` → lm-eval config + task list
`finetune.py`	Training logic (unchanged)
`slm-lm-eval`	Academic benchmarks

Troubleshooting

Symptom	Fix
`Secret huggingface not found`	`modal secret create huggingface HF_TOKEN=...`
Volume empty after run	Job may have failed; `modal volume ls slm-finetune`; ensure writes went to `/vol/finetuned` not `/repo`
`modal volume get` missing files	Call `commit()` completed; for same-container reads use `volume.reload()`
Large file won't download in UI	Use `modal volume get` CLI (16 MB UI limit)
`modal volume get` path wrong	Job name = top-level folder (e.g. `math-lora`, not `minicpm5-1b-lora`)
Gate fails / `published: false, reason: "gate failed"`	Check `gate.checks` in the output; adjust `goals` (`min_score`/`min_improve`/`guard_tasks`), `max_steps`, or dataset, then rerun
`published: false, reason: "no publish config..."`	Job has no `publish:` block in `experiments.yaml` (intentional for local-only jobs like `alpaca-lora`)
`Unknown eval_profile ...`	Check `eval_profile` in `experiments.yaml` matches a key in `research/evals/configs/eval_profiles.yaml`
Hub upload 403	Use a write `HF_TOKEN`; repos are created automatically (`exist_ok=True`), no need to pre-create
Still publishing to `your-hf-username/...`	Edit `defaults.hub_org` and each job's `publish.hub_repo` in `experiments.yaml`
Space cannot find adapter	Use merged weights or copy adapter into repo `models/finetuned/`
Image build slow	`hf-cache` Volume caches weights across runs
OOM on GPU	`--mode qlora` in `experiments.yaml`; lower `max_len` in finetune; or set a per-job `gpu:` with more VRAM
`scaledown_window` deploy error	Must be 2–3600s (we use 3600); see `_common.py`
`server_app` ping fails	`modal deploy research/modal/server_app.py`; start keep-alive: `modal run -d research/modal/server_app.py`
Jobs hit different containers	Deploy first; use `server_app.py` not `finetune_app.py` for warm loop
Worker still billing after done	`modal app stop slm-gpu-worker`

Hackathon checklist

Link or screenshot of Modal app run (slm-finetune-benchmark or slm-gpu-worker), including the --- summary --- table (skill, category, gate, published, hub_repo).
results/lm_eval/<job>__<profile>/comparison.md — baseline vs candidate per skill.
At least one adapter with goals that passed the gate and published to the Hub (model card auto-generated).
Adapter on Volume or Hub + ACTIVE_MODEL=minicpm5-1b-<skill>-lora on Space.
Optional: Notebook recording of smoke train cell.