Spaces:
Sleeping
Sleeping
| # Research usage | |
| How to run fine-tuning and agentic benchmarks under `research/`. All commands assume the **repo root** as the working directory unless noted. | |
| The Lesson Agent app lives in `apps/gradio-space/` — see root [USAGE.md](../USAGE.md). Research code is optional and isolated here. | |
| ## Prerequisites | |
| - [uv](https://docs.astral.sh/uv/) and Python 3.12 | |
| - GPU recommended for real-model runs (CPU works for smoke tests) | |
| - Hugging Face Hub access for model downloads and some benchmark datasets | |
| ## Install dependency groups | |
| ```bash | |
| # All research tooling | |
| uv sync --group finetune --group evals --group lm-eval | |
| # Or one at a time | |
| uv sync --group finetune | |
| uv sync --group evals | |
| uv sync --group lm-eval | |
| ``` | |
| | Group | Package / script | What it adds | | |
| | ----- | ---------------- | ------------ | | |
| | `finetune` | `research/finetune.py` | `peft`, `datasets`, `bitsandbytes` (QLoRA) | | |
| | `evals` | `slm-evals` workspace member | `slm-benchmark` CLI | | |
| | `lm-eval` | `slm-evals[lm-eval]` | `slm-lm-eval` CLI (GSM8K, ARC, HellaSwag, …) | | |
| | `modal` | `research/modal/finetune_app.py` | Cloud GPU train + eval via [Modal](https://modal.com/docs/guide) | | |
| | `modal` | `research/modal/server_app.py` | Long-lived warm GPU worker for human/AI iteration loops | | |
| --- | |
| ## 0. Modal cloud GPU (`research/modal/`) | |
| Run a **skill-matrix** of QLoRA fine-tunes **without local CUDA**: each job in | |
| [`modal/experiments.yaml`](modal/experiments.yaml) trains one adapter for a | |
| category (math, science, coding, reasoning, teaching, instructions), evaluates | |
| it against a matching `slm-lm-eval` profile vs. a per-profile baseline, checks | |
| the result against `goals`, and — only if the gate passes — publishes the | |
| adapter to the Hugging Face Hub. Adapters + results are saved to Modal Volume | |
| `slm-finetune`. | |
| ```bash | |
| uv sync --group modal | |
| modal setup | |
| modal secret create huggingface HF_TOKEN=<token> # needs write access for Hub publish | |
| # Smoke run for one skill: baseline -> train -> eval -> gate -> publish -> pull | |
| modal run research/modal/finetune_app.py --job math-lora --max-steps 20 | |
| # Whole skill matrix | |
| modal run research/modal/finetune_app.py | |
| # One category, train+eval only (no Hub push) | |
| modal run research/modal/finetune_app.py --category science --no-publish | |
| # Re-check the gate and publish an already-evaluated job | |
| modal run research/modal/finetune_app.py::publish_only --job math-lora | |
| # Pull adapters + lm-eval results without re-running anything | |
| modal run research/modal/finetune_app.py::pull --category math | |
| ``` | |
| Set real values for `defaults.hub_org` and each job's `publish.hub_repo` in | |
| `experiments.yaml` (placeholder: `your-hf-username`) before publishing — repos | |
| are created automatically. Jobs with no `goals` (e.g. `alpaca-lora`) are | |
| trained/evaluated but never gated or published (local-only). | |
| For a multi-hour session on **one warm GPU** (iterative human/AI loop without | |
| re-downloading weights each run), use `research/modal/server_app.py` instead — | |
| same skill-matrix pipeline (`--job`/`--category`/`--pipeline`/`--publish-only`) | |
| on a deployed `GpuWorker`. | |
| Full guide: **[modal/README.md](modal/README.md)** · **Agent loop:** **[modal/SERVER.md](modal/SERVER.md)** · [Modal Volumes](https://modal.com/docs/guide/volumes) · [Modal Notebooks](https://modal.com/docs/guide/notebooks) | |
| **Iterative loop (one warm GPU, many runs):** | |
| ```bash | |
| modal deploy research/modal/server_app.py | |
| modal run -d research/modal/server_app.py --hours 6 # keep worker alive | |
| modal run research/modal/server_app.py --ping # verify | |
| modal run research/modal/server_app.py --job lesson-lora --max-steps 20 | |
| modal app stop slm-gpu-worker -y # when done | |
| ``` | |
| Interactive notebook: upload [`research/notebook/minicpm5-modal-finetune.ipynb`](notebook/minicpm5-modal-finetune.ipynb) at [modal.com/notebooks](https://modal.com/notebooks), attach GPU + Volume `slm-finetune` + Secret `huggingface`. | |
| --- | |
| ## 1. Fine-tuning (`research/finetune.py`) | |
| Single script for **full**, **LoRA**, and **QLoRA** training. Defaults to the lesson-agent chat dataset at `research/data/education-lesson-chat.jsonl` and writes checkpoints under `models/finetuned/`. | |
| ### Model resolution (first match wins) | |
| 1. `--model <hf-id-or-path>` | |
| 2. `--preset <key>` from root `models.yaml` | |
| 3. Env: `FINETUNE_MODEL`, `MODEL_ID`, or `BASE` | |
| 4. `ACTIVE_MODEL` preset from `.env` | |
| ### Quick start | |
| ```bash | |
| # LoRA on default lesson chat data, 1 epoch | |
| uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1 | |
| # Smoke run (50 steps) | |
| uv run python research/finetune.py --mode lora --max_steps 50 | |
| # QLoRA on a Hub instruction dataset | |
| uv run python research/finetune.py \ | |
| --model Qwen/Qwen2.5-0.5B-Instruct \ | |
| --dataset tatsu-lab/alpaca --format alpaca \ | |
| --mode qlora --epochs 1 | |
| # Merge LoRA adapter into standalone weights | |
| uv run python research/finetune.py \ | |
| --merge ./models/finetuned/minicpm5-1b-lora \ | |
| --out ./models/finetuned/minicpm5-1b-merged | |
| ``` | |
| ### Dataset formats (`--format`) | |
| | Format | Expected columns | | |
| | ------ | ---------------- | | |
| | `chat` | `messages`: `[{"role": "...", "content": "..."}]` | | |
| | `alpaca` | `instruction`, optional `input`, `output` | | |
| | `prompt` | `prompt` / `completion` (or `response`) | | |
| | `text` | `text`, or a plain `.txt` file | | |
| Local files: `.json`, `.jsonl`, `.csv`, `.txt`. Hub ids: any `datasets` repo id. | |
| ### Outputs | |
| Training writes to `<out>/` (default `./models/finetuned/<preset>-<mode>/`): | |
| - Adapter or full weights | |
| - `training_results.json` — train/eval loss, perplexity, `result_score` (0–100) | |
| ### Env vars | |
| | Variable | Description | | |
| | -------- | ----------- | | |
| | `FINETUNE_PRESET` | Preset key from `models.yaml` | | |
| | `FINETUNE_DATASET` | Override dataset path or Hub id | | |
| | `FINETUNE_DATASET_CONFIG` | Hub config name | | |
| | `FINETUNE_DATASET_SPLIT` | Hub split (e.g. `train[:500]`) | | |
| | `ACTIVE_MODEL` | Fallback preset when `--preset` omitted | | |
| --- | |
| ## 2. Agentic benchmarks (`research/evals/`) | |
| Evaluate local HuggingFace checkpoints on BFCL, τ-bench, GAIA, and SWE-bench Verified. | |
| Install: `uv sync --group evals` | |
| ```bash | |
| # Smoke test (20 samples, two benchmarks) | |
| uv run --package slm-evals slm-benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl tau_bench \ | |
| --max-samples 20 | |
| # Full config-driven run | |
| uv run --package slm-evals slm-benchmark \ | |
| --config research/evals/configs/experiment_001.yaml | |
| ``` | |
| Full reference: [evals/USAGE.md](evals/USAGE.md). | |
| --- | |
| ## 3. Academic benchmarks (`slm-lm-eval`) | |
| Standard lm-evaluation-harness tasks (ARC, HellaSwag, GSM8K, …) for base presets, LoRA adapters, and merged checkpoints. | |
| Install: `uv sync --group lm-eval` | |
| Profile guide: [evals/docs/eval_profiles.md](evals/docs/eval_profiles.md) | |
| ```bash | |
| # List claim-matched profiles (reasoning, code, understanding, …) | |
| uv run --package slm-evals slm-lm-eval --list-profiles | |
| # Run by profile name | |
| uv run --package slm-evals slm-lm-eval \ | |
| --profile reasoning \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__reasoning-baseline | |
| # Smoke (25 samples, arc_easy + hellaswag) | |
| uv run --package slm-evals slm-lm-eval \ | |
| --profile smoke \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__smoke | |
| # Full profile | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_minicpm5.yaml \ | |
| --preset minicpm5-1b-lesson-lora \ | |
| --experiment-name minicpm5-1b-lora__v1 \ | |
| --compare-to results/lm_eval/minicpm5-1b__baseline/results.json | |
| ``` | |
| Post-training hook: | |
| ```bash | |
| uv run python research/finetune.py \ | |
| --preset minicpm5-1b --mode lora --max_steps 50 \ | |
| --lm-eval-after \ | |
| --lm-eval-baseline minicpm5-1b | |
| ``` | |
| Full reference: [evals/USAGE.md](evals/USAGE.md#lm-evaluation-harness-slm-lm-eval). | |
| --- | |
| ## Shared data (`research/data/`) | |
| | File | Used by | Format | | |
| | ---- | ------- | ------ | | |
| | `education-lesson-chat.jsonl` | `finetune.py` default | Chat messages for lesson agent | | |
| | `benchmark-qa.jsonl` | Optional domain QA evals | `question`, `answer`, `domain` | | |
| | `benchmark-kb.jsonl` | Optional retrieval snippets | KB entries for domain QA | | |
| --- | |
| ## Suggested end-to-end pipeline | |
| 1. **Baseline lm-eval** — academic benchmarks on the base preset (pinned seed): | |
| ```bash | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_compare_study.yaml \ | |
| --preset minicpm5-1b \ | |
| --experiment-name minicpm5-1b__baseline | |
| ``` | |
| 2. **Baseline agentic eval** (optional): | |
| ```bash | |
| uv run --package slm-evals slm-benchmark \ | |
| --model openbmb/MiniCPM5-1B --benchmarks bfcl --max-samples 50 | |
| ``` | |
| 3. **Fine-tune** on lesson data: | |
| ```bash | |
| uv run python research/finetune.py --preset minicpm5-1b --mode lora --epochs 1 | |
| ``` | |
| 4. **Re-eval candidate** with the same lm-eval config: | |
| ```bash | |
| uv run --package slm-evals slm-lm-eval \ | |
| --config research/evals/configs/lm_eval_compare_study.yaml \ | |
| --preset minicpm5-1b-lesson-lora \ | |
| --experiment-name minicpm5-1b-lora__v1 \ | |
| --compare-to results/lm_eval/minicpm5-1b__baseline/results.json | |
| ``` | |
| ### Verification checklist | |
| - Use the **same** lm-eval YAML (`tasks`, `num_fewshot`, `limit`, `seed`) for baseline and candidate runs. | |
| - Compare lm-eval `results.json` files with `--compare-to`; do not compare `training_results.json` `result_score` to lm-eval accuracy. | |
| - For LoRA checkpoints, prefer `--preset minicpm5-1b-lesson-lora` (base + adapter) over passing the adapter dir alone to `--model`. | |
| - Report mean ± std only after multiple training seeds; single-seed deltas are indicative, not conclusive. | |
| --- | |
| ## Troubleshooting | |
| | Symptom | Fix | | |
| | ------- | --- | | |
| | `slm-benchmark: command not found` | `uv sync --group evals` | | |
| | `slm-lm-eval: command not found` | `uv sync --group lm-eval` | | |
| | CUDA OOM during finetune | Use `--mode qlora` or reduce batch size in script args | | |
| | BFCL / GAIA download slow | Set `max_samples` low first; cache HF datasets under `~/.cache/huggingface` | | |
| | SWE-bench Docker errors | Keep `full_eval: false` in YAML unless `swebench` + Docker are installed | | |
| | τ-bench API costs | Keep `use_llm_user: false` (rule-based user simulator) | | |