lesson-agent-dev / research /docs /overview.md
MSG
Feat/last sprint (#12)
871f869
|
Raw
History Blame Contribute Delete
2.66 kB
# Research overview
How `research/` relates to the main hackathon repo and what each component does.
## Position in the repo
```text
small-model-hackathon/
β”œβ”€β”€ apps/gradio-space/ ← shipped Lesson Agent UI
β”œβ”€β”€ libs/agent/ ← skill loop, tools, traces
β”œβ”€β”€ libs/inference/ ← transformers + llama.cpp backends
β”œβ”€β”€ models.yaml ← model presets (shared with finetune)
└── research/ ← experiments (this tree)
β”œβ”€β”€ finetune.py
β”œβ”€β”€ data/
└── evals/ ← uv workspace package
```
Research code is a **uv workspace sibling** of `apps/*` and `libs/*`. Root `pyproject.toml` declares optional dependency groups (`finetune`, `evals`, `lm-eval`) so the Docker Space image does not need to install torch-heavy extras unless you opt in locally.
## Two tracks
### Fine-tuning
`research/finetune.py` adapts a small HF causal LM on instruction or chat data. It reuses root `models.yaml` presets and the shared inference config loader, so the same `minicpm5-1b` preset used in the Gradio app can be fine-tuned without duplicating model metadata.
Outputs land in `models/finetuned/` β€” you can register a new preset in `models.yaml` pointing at merged weights for the **Well-Tuned** hackathon badge.
### Agentic and academic evals
`research/evals/` (`slm-evals` package) scores **whole models** on:
- **Agentic benchmarks** β€” BFCL, Ο„-bench, GAIA, SWE-bench (`slm-benchmark`)
- **Academic benchmarks** β€” GSM8K, ARC, HellaSwag, etc. via lm-evaluation-harness (`slm-lm-eval`)
## Data flow
```mermaid
flowchart LR
subgraph data [research/data]
lesson[education-lesson-chat.jsonl]
qa[benchmark-qa.jsonl]
kb[benchmark-kb.jsonl]
end
subgraph train [finetune.py]
ckpt[models/finetuned/]
end
subgraph evals [slm-evals]
bfcl[BFCL]
tau[tau-bench]
gaia[GAIA]
swe[SWE-bench]
lmeval[lm-eval tasks]
end
lesson --> train
train --> ckpt
ckpt --> evals
```
## When to use which tool
| Goal | Tool |
| ---- | ---- |
| Improve lesson slide quality on your data | `finetune.py` + optional eval before/after |
| Compare base vs LoRA on public agent tasks | `slm-benchmark` |
| Compare base vs LoRA on academic tasks | `slm-lm-eval` |
| Ship in Gradio Space | `apps/gradio-space` only β€” wire new weights via `models.yaml` |
## Workspace package
`research/evals` is listed in root `[tool.uv.workspace] members` as import name `slm_evals`, CLI `slm-benchmark` and `slm-lm-eval`.
Run with `uv run --package slm-evals ...` from the repo root so uv resolves workspace paths and shared lockfile versions.