MSG
Feat/finetuning model (#18)
6cea344
|
Raw
History Blame Contribute Delete
10.2 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Research usage

How to run fine-tuning and agentic benchmarks under research/. All commands assume the repo root as the working directory unless noted.

The Lesson Agent app lives in apps/gradio-space/ — see root USAGE.md. Research code is optional and isolated here.

Prerequisites

  • uv and Python 3.12
  • GPU recommended for real-model runs (CPU works for smoke tests)
  • Hugging Face Hub access for model downloads and some benchmark datasets

Install dependency groups

# All research tooling
uv sync --group finetune --group evals --group lm-eval

# Or one at a time
uv sync --group finetune
uv sync --group evals
uv sync --group lm-eval
Group Package / script What it adds
finetune research/finetune.py peft, datasets, bitsandbytes (QLoRA)
evals slm-evals workspace member slm-benchmark CLI
lm-eval slm-evals[lm-eval] slm-lm-eval CLI (GSM8K, ARC, HellaSwag, …)
modal research/modal/finetune_app.py Cloud GPU train + eval via Modal
modal research/modal/server_app.py Long-lived warm GPU worker for human/AI iteration loops

0. Modal cloud GPU (research/modal/)

Run a skill-matrix of QLoRA fine-tunes without local CUDA: each job in modal/experiments.yaml trains one adapter for a category (math, science, coding, reasoning, teaching, instructions), evaluates it against a matching slm-lm-eval profile vs. a per-profile baseline, checks the result against goals, and — only if the gate passes — publishes the adapter to the Hugging Face Hub. Adapters + results are saved to Modal Volume slm-finetune.

uv sync --group modal
modal setup
modal secret create huggingface HF_TOKEN=<token>   # needs write access for Hub publish

# Smoke run for one skill: baseline -> train -> eval -> gate -> publish -> pull
modal run research/modal/finetune_app.py --job math-lora --max-steps 20

# Whole skill matrix
modal run research/modal/finetune_app.py

# One category, train+eval only (no Hub push)
modal run research/modal/finetune_app.py --category science --no-publish

# Re-check the gate and publish an already-evaluated job
modal run research/modal/finetune_app.py::publish_only --job math-lora

# Pull adapters + lm-eval results without re-running anything
modal run research/modal/finetune_app.py::pull --category math

Set real values for defaults.hub_org and each job's publish.hub_repo in experiments.yaml (placeholder: your-hf-username) before publishing — repos are created automatically. Jobs with no goals (e.g. alpaca-lora) are trained/evaluated but never gated or published (local-only).

For a multi-hour session on one warm GPU (iterative human/AI loop without re-downloading weights each run), use research/modal/server_app.py instead — same skill-matrix pipeline (--job/--category/--pipeline/--publish-only) on a deployed GpuWorker.

Full guide: modal/README.md · Agent loop: modal/SERVER.md · Modal Volumes · Modal Notebooks

Iterative loop (one warm GPU, many runs):

modal deploy research/modal/server_app.py
modal run -d research/modal/server_app.py --hours 6          # keep worker alive
modal run research/modal/server_app.py --ping                # verify
modal run research/modal/server_app.py --job lesson-lora --max-steps 20
modal app stop slm-gpu-worker -y                             # when done

Interactive notebook: upload research/notebook/minicpm5-modal-finetune.ipynb at modal.com/notebooks, attach GPU + Volume slm-finetune + Secret huggingface.


1. Fine-tuning (research/finetune.py)

Single script for full, LoRA, and QLoRA training. Defaults to the lesson-agent chat dataset at research/data/education-lesson-chat.jsonl and writes checkpoints under models/finetuned/.

Model resolution (first match wins)

  1. --model <hf-id-or-path>
  2. --preset <key> from root models.yaml
  3. Env: FINETUNE_MODEL, MODEL_ID, or BASE
  4. ACTIVE_MODEL preset from .env

Quick start

# LoRA on default lesson chat data, 1 epoch
uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1

# Smoke run (50 steps)
uv run python research/finetune.py --mode lora --max_steps 50

# QLoRA on a Hub instruction dataset
uv run python research/finetune.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --dataset tatsu-lab/alpaca --format alpaca \
  --mode qlora --epochs 1

# Merge LoRA adapter into standalone weights
uv run python research/finetune.py \
  --merge ./models/finetuned/minicpm5-1b-lora \
  --out ./models/finetuned/minicpm5-1b-merged

Dataset formats (--format)

Format Expected columns
chat messages: [{"role": "...", "content": "..."}]
alpaca instruction, optional input, output
prompt prompt / completion (or response)
text text, or a plain .txt file

Local files: .json, .jsonl, .csv, .txt. Hub ids: any datasets repo id.

Outputs

Training writes to <out>/ (default ./models/finetuned/<preset>-<mode>/):

  • Adapter or full weights
  • training_results.json — train/eval loss, perplexity, result_score (0–100)

Env vars

Variable Description
FINETUNE_PRESET Preset key from models.yaml
FINETUNE_DATASET Override dataset path or Hub id
FINETUNE_DATASET_CONFIG Hub config name
FINETUNE_DATASET_SPLIT Hub split (e.g. train[:500])
ACTIVE_MODEL Fallback preset when --preset omitted

2. Agentic benchmarks (research/evals/)

Evaluate local HuggingFace checkpoints on BFCL, τ-bench, GAIA, and SWE-bench Verified.

Install: uv sync --group evals

# Smoke test (20 samples, two benchmarks)
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# Full config-driven run
uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/experiment_001.yaml

Full reference: evals/USAGE.md.


3. Academic benchmarks (slm-lm-eval)

Standard lm-evaluation-harness tasks (ARC, HellaSwag, GSM8K, …) for base presets, LoRA adapters, and merged checkpoints.

Install: uv sync --group lm-eval

Profile guide: evals/docs/eval_profiles.md

# List claim-matched profiles (reasoning, code, understanding, …)
uv run --package slm-evals slm-lm-eval --list-profiles

# Run by profile name
uv run --package slm-evals slm-lm-eval \
  --profile reasoning \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__reasoning-baseline

# Smoke (25 samples, arc_easy + hellaswag)
uv run --package slm-evals slm-lm-eval \
  --profile smoke \
  --preset minicpm5-1b \
  --experiment-name minicpm5-1b__smoke

# Full profile
uv run --package slm-evals slm-lm-eval \
  --config research/evals/configs/lm_eval_minicpm5.yaml \
  --preset minicpm5-1b-lesson-lora \
  --experiment-name minicpm5-1b-lora__v1 \
  --compare-to results/lm_eval/minicpm5-1b__baseline/results.json

Post-training hook:

uv run python research/finetune.py \
  --preset minicpm5-1b --mode lora --max_steps 50 \
  --lm-eval-after \
  --lm-eval-baseline minicpm5-1b

Full reference: evals/USAGE.md.


Shared data (research/data/)

File Used by Format
education-lesson-chat.jsonl finetune.py default Chat messages for lesson agent
benchmark-qa.jsonl Optional domain QA evals question, answer, domain
benchmark-kb.jsonl Optional retrieval snippets KB entries for domain QA

Suggested end-to-end pipeline

  1. Baseline lm-eval — academic benchmarks on the base preset (pinned seed):

    uv run --package slm-evals slm-lm-eval \
      --config research/evals/configs/lm_eval_compare_study.yaml \
      --preset minicpm5-1b \
      --experiment-name minicpm5-1b__baseline
    
  2. Baseline agentic eval (optional):

    uv run --package slm-evals slm-benchmark \
      --model openbmb/MiniCPM5-1B --benchmarks bfcl --max-samples 50
    
  3. Fine-tune on lesson data:

    uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1
    
  4. Re-eval candidate with the same lm-eval config:

    uv run --package slm-evals slm-lm-eval \
      --config research/evals/configs/lm_eval_compare_study.yaml \
      --preset minicpm5-1b-lesson-lora \
      --experiment-name minicpm5-1b-lora__v1 \
      --compare-to results/lm_eval/minicpm5-1b__baseline/results.json
    

Verification checklist

  • Use the same lm-eval YAML (tasks, num_fewshot, limit, seed) for baseline and candidate runs.
  • Compare lm-eval results.json files with --compare-to; do not compare training_results.json result_score to lm-eval accuracy.
  • For LoRA checkpoints, prefer --preset minicpm5-1b-lesson-lora (base + adapter) over passing the adapter dir alone to --model.
  • Report mean ± std only after multiple training seeds; single-seed deltas are indicative, not conclusive.

Troubleshooting

Symptom Fix
slm-benchmark: command not found uv sync --group evals
slm-lm-eval: command not found uv sync --group lm-eval
CUDA OOM during finetune Use --mode qlora or reduce batch size in script args
BFCL / GAIA download slow Set max_samples low first; cache HF datasets under ~/.cache/huggingface
SWE-bench Docker errors Keep full_eval: false in YAML unless swebench + Docker are installed
τ-bench API costs Keep use_llm_user: false (rule-based user simulator)