# Research Experimental code for **fine-tuning** and **agentic benchmarks**. Nothing here is wired into the Gradio Lesson Agent by default — use it to train models and score checkpoints against public benchmarks. | Path | Purpose | | ---- | ------- | | [`finetune.py`](finetune.py) | LoRA / QLoRA / full fine-tune on chat or instruction data | | [`evals/`](evals/) | SLM agentic benchmark suite — BFCL, τ-bench, GAIA, SWE-bench (uv package `slm-evals`) | | [`data/`](data/) | Shared JSONL datasets for finetune and evals | ## Quick links - **[USAGE.md](USAGE.md)** — install groups, commands, and typical workflows - **[docs/overview.md](docs/overview.md)** — how the pieces fit together - **[evals/USAGE.md](evals/USAGE.md)** — benchmark CLI, configs, and results - **[evals/docs/benchmarks.md](evals/docs/benchmarks.md)** — what each benchmark measures ## Install (from repo root) ```bash # All research tooling uv sync --group finetune --group evals --group lm-eval ``` Individual groups: | Group | Command | Enables | | ----- | ------- | ------- | | `finetune` | `uv sync --group finetune` | `research/finetune.py` (LoRA, QLoRA, merge) | | `evals` | `uv sync --group evals` | `research/evals/` package (`slm-benchmark`) | | `lm-eval` | `uv sync --group lm-eval` | `slm-lm-eval` CLI (GSM8K, ARC, HellaSwag, …) | ## Typical workflow ```text research/data/education-lesson-chat.jsonl │ ▼ research/finetune.py ──► models/finetuned/-lora/ │ └──► research/evals/ (BFCL, τ-bench, GAIA, SWE-bench, lm-eval) ``` See [USAGE.md](USAGE.md) for copy-paste commands.