Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Modal finetune + benchmark
GPU fine-tuning + benchmarking + Hub publishing on Modal for openbmb/MiniCPM5-1B, wrapping existing research/finetune.py and slm-lm-eval.
Use this when you have no local CUDA but want a hackathon-quality train β eval β gate β publish loop for a whole skill matrix of QLoRA adapters (math, science, coding, reasoning, teaching, instructions).
| Track | What you ship |
|---|---|
| Modal | modal run skill-matrix pipeline, Volume artifacts, optional Modal Notebook |
| Well-Tuned | Per-skill before/after lm-eval + gated Hub publish for each LoRA |
Layout
research/modal/
βββ _common.py # Shared image, volumes, command builders, gate + publish helpers
βββ finetune_app.py # One-shot batch pipeline (slm-finetune-benchmark): main, publish_only, pull
βββ server_app.py # Long-lived GPU worker (slm-gpu-worker): GpuWorker.run_pipeline
βββ experiments.yaml # Skill matrix: jobs, eval_profile, goals, publish
βββ README.md # Full Modal docs (this file)
βββ SERVER.md # Human + AI agent loop runbook (quick reference)
Interactive path: research/notebook/minicpm5-modal-finetune.ipynb (Modal GPU Notebook).
Which app to use
| App | CLI | Best for |
|---|---|---|
finetune_app.py |
modal run research/modal/finetune_app.py |
Full sweep, CI-style batch, parallel jobs |
server_app.py |
modal deploy + modal run research/modal/server_app.py |
Multi-hour session, iterative human/AI loops on one warm GPU |
Both apps share _common.py: same image, hf-cache / slm-finetune volumes, and wrappers around research/finetune.py + slm-lm-eval.
One-time setup
# Modal CLI + auth
pip install modal
modal setup
# HF token (downloads + Hub upload). Same token as huggingface-cli login.
modal secret create huggingface HF_TOKEN=<your-hf-token>
# Optional: validate deps before first image build
uv sync --group finetune --group lm-eval --package slm-evals
uv sync --group modal # local orchestration only
HF_TOKEN must be a write token if you plan to push adapters to the Hub.
Run training + benchmarks
All commands from repo root. finetune_app.py runs the full skill-matrix
pipeline: per-profile base-model baseline lm-eval (no adapter) β finetune each job's QLoRA adapter β
post-train lm-eval vs. that baseline β check goals (gate) β publish to the
Hugging Face Hub if the gate passes β pull adapter + results to your laptop.
# Full sweep: every job in experiments.yaml
modal run research/modal/finetune_app.py
# One skill (cheap smoke run)
modal run research/modal/finetune_app.py --job math-lora --max-steps 20
# One category (e.g. all "science" jobs)
modal run research/modal/finetune_app.py --category science
# Re-run lm-eval (+ gate + publish) only β adapter already on Volume
modal run research/modal/finetune_app.py --eval-only --job math-lora
# Train + eval but skip the Hub push and the local download
modal run research/modal/finetune_app.py --no-publish --no-pull
# Train/eval jobs in parallel (one GPU per job β higher cost)
modal run research/modal/finetune_app.py --parallel
# Re-run just the gate + Hub publish for an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora
# Pull adapters + lm-eval results for a category without re-running anything
modal run research/modal/finetune_app.py::pull --category math
Jobs live in experiments.yaml β a skill matrix, one
QLoRA adapter per category, each evaluated against the matching
eval_profile from research/evals/configs/eval_profiles.yaml:
| Job | Category | Dataset (format) | Eval profile | goals task |
Publish |
|---|---|---|---|---|---|
teaching-lora |
teaching | research/data/education-lesson-chat.jsonl (chat) |
instructions |
ifeval |
β |
science-lora |
science | research/data/science-tutor-chat.jsonl (chat) |
science |
sciq (+ arc_challenge guard) |
β |
math-lora |
math | TIGER-Lab/MathInstruct (alpaca) |
math |
gsm8k (+ arc_challenge guard) |
β |
coding-lora |
coding | iamtarun/python_code_instructions_18k_alpaca (alpaca) |
code |
mbpp |
β |
reasoning-lora |
reasoning | HuggingFaceTB/smoltalk (chat) |
reasoning |
gsm8k (+ hellaswag guard) |
β |
language-lesson-lora |
language | language-lesson-fr/ar.jsonl (chat) |
multilingual |
xnli (+ hellaswag guard) |
β |
french-lora |
french | FrancophonIA/english_french (prompt) + FR chat |
french |
french_bench_xnli (+ hellaswag guard) |
β |
alpaca-lora |
instructions | tatsu-lab/alpaca (alpaca) |
instructions |
β (no goals) |
local-only |
Before publishing, replace defaults.hub_org and each job's publish.hub_repo
in experiments.yaml with your Hugging Face username/org (defaults to the
placeholder your-hf-username).
Edit defaults.max_steps, per-job gpu, or per-job max_samples /
dataset_split in experiments.yaml to balance cost vs quality. See
Benchmark gate & Hugging Face Hub publish
for the goals/publish schema.
CLI flags (finetune_app.py)
main (default entrypoint β full pipeline):
| Flag | Default | Meaning |
|---|---|---|
--train / --no-train |
train on | Run finetune jobs |
--eval-only |
off | Skip train; eval existing Volume checkpoints (still runs missing base-model baselines) |
--parallel |
off | finetune_one.spawn() per job instead of sequential |
--job |
all jobs | Run one job name from experiments.yaml |
--category |
all categories | Run all jobs with this category |
--max-steps |
from YAML | Override training steps |
--publish / --no-publish |
publish on | Push to publish.hub_repo if the gate passes |
--pull / --no-pull |
pull on | modal volume get the adapter + lm-eval results after each job |
publish_only (separate entrypoint β ::publish_only):
| Flag | Default | Meaning |
|---|---|---|
--job |
required | Re-check the gate against existing results and publish if it passes |
pull (separate entrypoint β ::pull):
| Flag | Default | Meaning |
|---|---|---|
--job |
β | Pull one job's adapter + results |
--category |
β | Pull all jobs in a category |
--dest |
models/finetuned |
Local destination directory |
GPU worker (server_app.py) β human + AI agent loops
Use this when you want one warm A10G container for several hours and many train/eval commands without reinstalling deps or re-downloading HF weights each time.
Quick runbook: see SERVER.md (copy-paste commands for humans and coding agents).
Deploy once
modal deploy research/modal/server_app.py
App name: slm-gpu-worker. Dashboard: modal app list or the URL printed after deploy.
GpuWorker keeps min_containers=1 while deployed, mounts hf-cache + slm-finetune, and reuses the same container for sequential .remote() calls when possible.
Two-terminal loop (recommended)
Terminal 1 β keep worker alive (default 4h; blocks unless detached):
modal run research/modal/server_app.py
# or free your terminal:
modal run -d research/modal/server_app.py --hours 6
Terminal 2 β run experiments on the warm GPU (repeat as often as you like):
# Full skill-matrix pipeline for one job on the warm container:
# per-profile baseline β train β eval β gate β publish β pull
modal run research/modal/server_app.py --job math-lora --max-steps 20
# All jobs in a category
modal run research/modal/server_app.py --category science
# Whole matrix, but skip the Hub push
modal run research/modal/server_app.py --pipeline --no-publish
# Re-eval (+ gate + publish) an existing adapter on Volume
modal run research/modal/server_app.py --eval-only --job math-lora
# Re-check the gate and publish using already-computed results
modal run research/modal/server_app.py --publish-only --job math-lora
# Arbitrary command in /repo (same env as finetune.py)
modal run research/modal/server_app.py --cmd "uv run python research/finetune.py --help"
# Health check
modal run research/modal/server_app.py --ping
Task flags (--job, --category, --cmd, --pipeline, --eval-only, --publish-only, --ping) automatically disable the default keep-alive mode.
CLI flags (server_app.py)
| Flag | Default | Meaning |
|---|---|---|
| (none) | serve=True |
Keep GpuWorker alive (keep_alive) |
--hours |
4 |
Keep-alive duration |
--no-serve |
β | Skip keep-alive (auto when any task flag is set) |
--job |
β | Run the skill-matrix pipeline for one job |
--category |
β | Run the skill-matrix pipeline for all jobs in a category |
--pipeline |
off | Run the skill-matrix pipeline for all jobs |
--max-steps |
from YAML | Override training steps |
--eval-only |
off | Pipeline eval/gate/publish only (skip train; still runs missing base-model baselines) |
--publish / --no-publish |
publish on | Push to publish.hub_repo if the gate passes |
--publish-only |
off | Re-check the gate against existing results and publish (requires --job) |
--pull / --no-pull |
pull on | modal volume get adapter + results after the pipeline |
--cmd |
β | Shell command (parsed with shlex) |
--ping |
off | Return worker status JSON |
GpuWorker methods (for notebooks / Python callers)
After modal deploy, call from Python:
import modal
Worker = modal.Cls.from_name("slm-gpu-worker", "GpuWorker")
w = Worker()
w.ping.remote()
w.finetune.remote({"name": "math-lora", "dataset": "...", "format": "alpaca", "max_steps": 20})
w.lm_eval.remote(experiment_name="math-lora__math", config="research/evals/configs/lm_eval_math.yaml", adapter_path="/vol/finetuned/math-lora")
w.exec_cmd.remote(["uv", "run", "python", "research/finetune.py", "--help"])
w.run_pipeline.remote(job_names=["math-lora"], max_steps=20)
# Gate + publish (only pushes to the Hub if gate_result["passed"])
gate = w.check_gate.remote(
candidate_results_path="/vol/finetuned/results/lm_eval/math-lora__math/results.json",
baseline_results_path="/vol/finetuned/results/lm_eval/minicpm5-1b__baseline__math/results.json",
goals={"task": "gsm8k", "min_score": 0.05, "min_improve": 0.02},
)
w.publish_adapter.remote(job=..., adapter_dir="/vol/finetuned/math-lora", gate_result=gate, ...)
Inside the class, run_pipeline chains lm_eval (baselines) β finetune β lm_eval (candidate) β check_gate β publish_adapter via .local(), so everything runs in the same container without extra cold starts.
Persistence (what survives between commands)
| Layer | Survives | Notes |
|---|---|---|
Image (uv sync baked in) |
Across all runs | Rebuilds only when image definition changes |
hf-cache Volume |
Across runs | Base weights + datasets; committed after each job |
slm-finetune Volume |
Across runs | Adapters + lm-eval results |
| Warm container | While deployed + idle < scaledown_window |
min_containers=1; max idle grace 3600s (Modal limit) |
keep_alive loop |
Up to --hours |
Container stays active; no scale-down during loop |
Stop / logs
modal app logs slm-gpu-worker -f # stream logs
modal app stop slm-gpu-worker # stop deployed app + warm pool
modal app stop slm-gpu-worker -y # no confirmation prompt
Refs: modal app Β· modal run Β· modal shell
Agent loop pattern
For an AI agent iterating on finetune hyperparameters or eval configs:
- Ensure worker is up:
modal run research/modal/server_app.py --pingβ{"status": "ok"}. - If ping fails, human or agent runs
modal deploy research/modal/server_app.pythenmodal run -d research/modal/server_app.py --hours 6. - Agent runs smoke train+eval+gate (no publish yet):
--job math-lora --max-steps 5 --no-publish. - Agent re-evals without retraining:
--eval-only --job math-lora. - Agent reads results:
modal volume get slm-finetune results/lm_eval/math-lora__math ./results/lm_eval/math-lora__mathormodal volume ls slm-finetune. - Agent adjusts
experiments.yaml'sgoals/max_steps/max_samples, repeats from step 3. - Once the gate passes and
hub_org/hub_repoare real:--publish-only --job math-lora, or just drop--no-publish. - When done:
modal app stop slm-gpu-worker(optional, stops GPU billing from warm pool).
See SERVER.md for a structured checklist and error recovery table.
What gets saved on Modal
Modal persists artifacts on Volumes β a distributed filesystem optimized for write-once, read-many workloads like model checkpoints. Files written only to the container disk (outside the mount path) are not saved.
| Volume | Mount in container | Contents |
|---|---|---|
slm-finetune |
/vol/finetuned |
LoRA adapters, training_results.json, lm-eval results/ |
hf-cache |
/root/.cache/huggingface |
Cached base weights + datasets |
Volumes are created lazily on first run (create_if_missing=True in finetune_app.py).
Commits and visibility
Per the Volumes guide:
volume.commit()β persist writes so other containers andmodal volume getcan see them. Our workers call this after each train/eval job.- Background commits β Modal also snapshots attached Volumes every few seconds and on container shutdown, but explicit
commit()is safest before download. volume.reload()β needed only if the same container must see writes from another container without restarting. Eachfinetune_one.remote()/run_lm_eval.remote()starts fresh and mounts the latest committed state.
Training writes under /vol/finetuned/... (the mount), not /repo/models/.... That matches Modalβs model checkpointing pattern: point finetune.py --out at the Volume path.
Per-job adapter layout
Each finetune job writes to a Volume path named after the job (e.g. math-lora/).
lm-eval results live under results/lm_eval/, named
<job_name>__<eval_profile> for candidates and <preset>__baseline__<eval_profile>
for the shared per-profile baselines:
slm-finetune (Volume)
βββ math-lora/
β βββ adapter_config.json
β βββ adapter_model.safetensors # or adapter_model.bin
β βββ tokenizer filesβ¦
β βββ training_results.json
β βββ README.md # model card, written by publish_adapter
βββ science-lora/
βββ coding-lora/
βββ reasoning-lora/
βββ teaching-lora/
βββ alpaca-lora/
βββ results/lm_eval/
βββ minicpm5-1b__baseline__math/ # shared by all "math" profile jobs
βββ minicpm5-1b__baseline__science/
βββ minicpm5-1b__baseline__instructions/
βββ math-lora__math/
βββ science-lora__science/
βββ ...
Because eval_profile is shared across jobs (e.g. teaching-lora and
alpaca-lora both use instructions), the instructions baseline is computed
once per pipeline run and reused for both jobs' gates.
Volume CLI (browse, download, upload)
Official reference: Modal Volumes guide Β· CLI reference
Create or list volumes
modal volume list
modal volume create slm-finetune # optional; app creates on first run
modal volume ls slm-finetune
modal volume ls slm-finetune lesson-lora
Browse in a shell
Volumes are mounted under /mnt in an interactive shell:
modal shell --volume slm-finetune
# inside shell:
ls /mnt/slm-finetune
ls /mnt/slm-finetune/lesson-lora
du -sh /mnt/slm-finetune/lesson-lora
Use du for size β Volumes do not report accurate df / disk_usage() values (docs).
Download LoRA to your machine
Use the CLI for adapter weights. The Modal web UI only supports downloads up to 16 MB per file; adapter_model.safetensors is usually larger (docs).
mkdir -p ./models/finetuned
# One job folder β local path expected by models.yaml
modal volume get slm-finetune lesson-lora ./models/finetuned/minicpm5-1b-lora
# lm-eval artifacts
mkdir -p ./results
modal volume get slm-finetune results/lm_eval ./results/lm_eval
# Entire volume (large)
modal volume get slm-finetune / ./modal-artifacts
Job folders use the job name from experiments.yaml (lesson-lora), not minicpm5-1b-lora. Root models.yaml preset minicpm5-1b-lesson-lora expects ./models/finetuned/minicpm5-1b-lora.
If you downloaded to a different folder name:
modal volume get slm-finetune lesson-lora ./models/finetuned/lesson-lora
cp -r ./models/finetuned/lesson-lora ./models/finetuned/minicpm5-1b-lora
Upload to a Volume from local
Push a local adapter or merged checkpoint back to Modal (modal volume put):
modal volume put slm-finetune ./models/finetuned/minicpm5-1b-lora lesson-lora
Or from Python (batch_upload):
import modal
vol = modal.Volume.from_name("slm-finetune")
with vol.batch_upload() as batch:
batch.put_directory(
"./models/finetuned/minicpm5-1b-lora",
"/lesson-lora",
)
Copy within a Volume
modal volume cp slm-finetune lesson-lora lesson-lora-backup
Parallel training note
With --parallel, multiple jobs write to different folders on the same Volume. On Volumes v1, avoid more than ~5 concurrent writers/commits (docs). Prefer sequential runs unless you use Volumes v2 (modal volume create --version=2).
Use downloaded weights locally
# Gradio / inference preset
export ACTIVE_MODEL=minicpm5-1b-lesson-lora
uv run --package gradio-space python -m gradio_space.app
# lm-eval on downloaded adapter
uv run --package slm-evals slm-lm-eval \
--config research/evals/configs/lm_eval_smoke.yaml \
--preset minicpm5-1b-lesson-lora \
--experiment-name minicpm5-1b-lora__local-check
Optional: merge LoRA into full weights locally
Adapters are small; merged weights are easier for some deploy targets.
uv run python research/finetune.py \
--merge ./models/finetuned/minicpm5-1b-lora \
--out ./models/finetuned/minicpm5-1b-lora-merged
Then use preset minicpm5-1b-lesson-merged or --model ./models/finetuned/minicpm5-1b-lora-merged.
Benchmark gate & Hugging Face Hub publish
finetune_app.py / server_app.py publish adapters to the Hub automatically,
but only when a job's lm-eval results pass its goals. This is the
"only ship it if it's actually better" gate.
goals schema (per job in experiments.yaml)
goals:
task: gsm8k # lm-eval task name, scored via primary_metric() (same as summary.md)
min_score: 0.05 # candidate score must be >= this
min_improve: 0.02 # candidate - baseline must be >= this (baseline = per-profile baseline run)
guard_tasks: # optional regression guards β must NOT regress more than max_regress
- task: arc_challenge
max_regress: 0.03
Publishable jobs also run a general eval (defaults.general_eval_profile, default
compare_study: arc_easy, arc_challenge, hellaswag, piqa, boolq, gsm8k) and must pass
defaults.general_goals regression guards so skill tuning does not wash out general
capability. The publish gate requires both skill goals and general_goals to pass.
A job with no goals (e.g. alpaca-lora) is never gated and never published β
it's local-only (still trained, evaluated, and pulled to your laptop).
publish schema (per job)
publish:
hub_repo: your-hf-username/minicpm5-1b-math-lora
private: false # public so judges can verify the Well-Tuned badge; set true to keep it hidden
What happens on a passing gate
run_lm_evalwrites skill results toresults/lm_eval/<job>__<profile>/results.json.- For publishable jobs, a second run writes general results to
results/lm_eval/<job>__<general_eval_profile>/results.json. check_gatecompares skill results againstresults/lm_eval/<preset>__baseline__<profile>/results.jsonand general results againstresults/lm_eval/<preset>__baseline__<general_eval_profile>/results.jsonusinggoals+general_goalsβ{"passed": bool, "skill": {...}, "general": {...}, "checks": [...]}.- If
passedandpublishis set,publish_adapter:- renders a model card (
README.md) into the adapter directory β base model, gate checks table, full lm-eval baseline-vs-candidate-vs-delta table, training stats, and a PEFT load snippet huggingface_hub.HfApi().create_repo(..., exist_ok=True)+upload_folder(...)topublish.hub_repo
- renders a model card (
If the gate fails, nothing is pushed β rerun with different max_steps /
dataset / goals, then modal run research/modal/finetune_app.py::publish_only --job <name>
once it passes (re-checks the gate against the latest results before publishing).
Setup
huggingface-cli login
# or: export HF_TOKEN=hf_... (needs write access; same token as `modal secret create huggingface`)
Set real values for defaults.hub_org and each job's publish.hub_repo in
experiments.yaml before running with --publish (the default). Repos are
created automatically (exist_ok=True) β no need to pre-create them on huggingface.co.
Manual Hugging Face Hub publish (fallback)
Use this if you'd rather download an adapter and push it yourself β e.g. for merged full weights, or adapters trained before the gate/publish pipeline existed.
Prerequisites
huggingface-cli login
# or: export HF_TOKEN=hf_...
Create an empty model repo on Hugging Face (e.g. your-user/minicpm5-1b-lesson-lora).
Option A β Upload LoRA adapter (recommended)
After modal volume get:
ADAPTER=./models/finetuned/minicpm5-1b-lora
REPO=your-user/minicpm5-1b-lesson-lora
huggingface-cli upload "$REPO" "$ADAPTER" . \
--repo-type model \
--commit-message "Lesson LoRA from Modal finetune"
Add a minimal README.md in the adapter folder before upload (or edit on the Hub) documenting the base model:
# MiniCPM5-1B lesson LoRA
- Base model: [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)
- Dataset: education lesson chat (Build Small hackathon)
- Load with PEFT: `PeftModel.from_pretrained(base, "your-user/minicpm5-1b-lesson-lora")`
Load from Hub in Python:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "openbmb/MiniCPM5-1B"
adapter = "your-user/minicpm5-1b-lesson-lora"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter)
Option B β Upload merged weights
uv run python research/finetune.py \
--merge ./models/finetuned/minicpm5-1b-lora \
--out ./models/finetuned/minicpm5-1b-lora-merged
huggingface-cli upload your-user/minicpm5-1b-lesson-merged \
./models/finetuned/minicpm5-1b-lora-merged . \
--repo-type model
Consumers set MODEL_ID=your-user/minicpm5-1b-lesson-merged with no adapter.
Option C β Upload from Modal shell (no local download)
Browse the Volume in a shell (docs):
modal shell --volume slm-finetune
Inside the shell (volume at /mnt/slm-finetune):
pip install huggingface_hub
export HF_TOKEN=... # write token
huggingface-cli upload your-user/minicpm5-1b-lesson-lora \
/mnt/slm-finetune/lesson-lora . --repo-type model
Downloading to your laptop first (Option A) is usually easier to review before publish.
Use on Hugging Face Space
LoRA on Space (Gradio SDK):
- Upload adapter repo (Option A).
- In Space Settings β Repository secrets, set
HF_TOKENif the base model needs it. - In Space env vars:
ACTIVE_MODEL=minicpm5-1b
# Override adapter via custom preset or env β e.g. add to models.yaml on Space:
# adapter_path: your-user/minicpm5-1b-lesson-lora # Hub id works if peft resolves it
For the shipped Space, the reliable path is: download adapter β commit into repo under models/finetuned/ β ACTIVE_MODEL=minicpm5-1b-lesson-lora, or upload merged weights and point MODEL_ID at your Hub repo.
Merged on Space:
ACTIVE_MODEL=custom
MODEL_ID=your-user/minicpm5-1b-lesson-merged
TRUST_REMOTE_CODE=true
Modal Notebooks (interactive GPU)
Official guide: Modal Notebooks
Use a hosted Jupyter kernel on Modal for demos, pair programming, and quick experiments. For reproducible sweeps and CI-style runs, prefer modal run research/modal/finetune_app.py.
Getting started
- Open modal.com/notebooks and upload
research/notebook/minicpm5-modal-finetune.ipynb(or create a notebook and copy the cells). - In the sidebar β Compute profile, enable a GPU (e.g. A10G). Notebooks are serverless: you pay only while the kernel runs; idle shutdown defaults to 10 minutes.
- Attach resources in the sidebar Files panel:
- Volume
slm-finetuneβ appears under/mnt/slm-finetune(share checkpoints withmodal runjobs) - Secret
huggingfaceβ injectsHF_TOKENfor Hub downloads
- Volume
- Run cells top to bottom.
The default notebook image includes PyTorch, Transformers, and NumPy. Install extras with:
%uv pip install uv peft bitsandbytes datasets
Persist checkpoints on a Volume
The container filesystem is ephemeral. Anything under /root is lost when the kernel stops. Write adapters to an attached Volume:
OUT = "/mnt/slm-finetune/lesson-lora-notebook" # survives kernel restarts
After training, download from the Files panel (β¬) or locally:
modal volume get slm-finetune lesson-lora-notebook ./models/finetuned/minicpm5-1b-lora
Custom image (optional, full repo deps)
To match the modal run environment exactly, deploy the app image once:
modal deploy research/modal/finetune_app.py
Then in the notebook sidebar, search for function finetune_one from app slm-finetune-benchmark and select that image as the kernel.
Or call deployed functions from a cell with %modal magic:
%modal from slm-finetune-benchmark import finetune_one
finetune_one.remote({
"name": "lesson-lora",
"dataset": "research/data/education-lesson-chat.jsonl",
"format": "chat",
"max_steps": 20,
})
(Requires modal deploy and the repo baked into the image.)
Share for hackathon judges
Use Share in the notebook editor β public unlisted link β Can view and run so reviewers can fork and execute without a Modal account (docs).
Notebook vs modal run
| Modal Notebook | modal run finetune_app.py |
|
|---|---|---|
| Best for | Demo video, exploration | Reproducible sweep, Volume + lm-eval pipeline |
| GPU | Sidebar compute profile | gpu="A10G" on functions |
| Persistence | Attach Volume in sidebar | slm-finetune Volume auto-mounted |
| Cost | Per kernel uptime | Per function invocation |
Architecture
flowchart LR
subgraph batch [finetune_app.py β batch]
laptop1["modal run finetune_app\n--job/--category"] --> base["run_lm_eval\n(per-profile baseline)"]
laptop1 --> train["finetune_one"]
train --> eval["run_lm_eval\n(candidate)"]
eval --> gate["check_gate\n(goals)"]
gate -- passed --> pub["publish_adapter"]
end
subgraph worker [server_app.py β warm loop]
laptop2["modal run server_app\n--job/--category/--pipeline"] --> gpu["GpuWorker A10G"]
gpu --> rp["run_pipeline\n(baseline -> train -> eval -> gate -> publish)"]
end
base --> vol["Volume slm-finetune"]
train --> vol
eval --> vol
gate --> vol
rp --> vol
gpu --> hfc["Volume hf-cache"]
pub --> hub["Hugging Face Hub\n(publish.hub_repo)"]
rp --> hub
vol --> get["modal volume get\n(pull)"]
get --> local["models/finetuned/<job>"]
local --> space["HF Space ACTIVE_MODEL"]
| Resource | Role |
|---|---|
App slm-finetune-benchmark |
One-shot batch pipeline (finetune_app.py): main, publish_only, pull |
App slm-gpu-worker |
Long-lived GPU worker (server_app.py): GpuWorker.run_pipeline |
GPU A10G (or per-job gpu: override) |
Default for train + eval |
Secret huggingface |
HF_TOKEN for HF downloads + Hub publish |
_common.py |
Shared image, volumes, command builders, gate (evaluate_gate/check_gate_files), publish (publish_adapter_files, render_model_card) |
experiments.yaml |
Skill matrix: jobs, eval_profile, goals, publish |
eval_profiles.yaml |
Maps eval_profile β lm-eval config + task list |
finetune.py |
Training logic (unchanged) |
slm-lm-eval |
Academic benchmarks |
Troubleshooting
| Symptom | Fix |
|---|---|
Secret huggingface not found |
modal secret create huggingface HF_TOKEN=... |
| Volume empty after run | Job may have failed; modal volume ls slm-finetune; ensure writes went to /vol/finetuned not /repo |
modal volume get missing files |
Call commit() completed; for same-container reads use volume.reload() |
| Large file won't download in UI | Use modal volume get CLI (16 MB UI limit) |
modal volume get path wrong |
Job name = top-level folder (e.g. math-lora, not minicpm5-1b-lora) |
Gate fails / published: false, reason: "gate failed" |
Check gate.checks in the output; adjust goals (min_score/min_improve/guard_tasks), max_steps, or dataset, then rerun |
published: false, reason: "no publish config..." |
Job has no publish: block in experiments.yaml (intentional for local-only jobs like alpaca-lora) |
Unknown eval_profile ... |
Check eval_profile in experiments.yaml matches a key in research/evals/configs/eval_profiles.yaml |
| Hub upload 403 | Use a write HF_TOKEN; repos are created automatically (exist_ok=True), no need to pre-create |
Still publishing to your-hf-username/... |
Edit defaults.hub_org and each job's publish.hub_repo in experiments.yaml |
| Space cannot find adapter | Use merged weights or copy adapter into repo models/finetuned/ |
| Image build slow | hf-cache Volume caches weights across runs |
| OOM on GPU | --mode qlora in experiments.yaml; lower max_len in finetune; or set a per-job gpu: with more VRAM |
scaledown_window deploy error |
Must be 2β3600s (we use 3600); see _common.py |
server_app ping fails |
modal deploy research/modal/server_app.py; start keep-alive: modal run -d research/modal/server_app.py |
| Jobs hit different containers | Deploy first; use server_app.py not finetune_app.py for warm loop |
| Worker still billing after done | modal app stop slm-gpu-worker |
Hackathon checklist
- Link or screenshot of Modal app run (
slm-finetune-benchmarkorslm-gpu-worker), including the--- summary ---table (skill, category, gate, published, hub_repo). results/lm_eval/<job>__<profile>/comparison.mdβ baseline vs candidate per skill.- At least one adapter with
goalsthat passed the gate and published to the Hub (model card auto-generated). - Adapter on Volume or Hub +
ACTIVE_MODEL=minicpm5-1b-<skill>-loraon Space. - Optional: Notebook recording of smoke train cell.
See also: SERVER.md Β· research/USAGE.md Β· Modal Volumes Β· Modal Notebooks Β· Modal CUDA