# Research overview

How `research/` relates to the main hackathon repo and what each component does.

## Position in the repo

```text
small-model-hackathon/
├── apps/gradio-space/     ← shipped Lesson Agent UI
├── libs/agent/            ← skill loop, tools, traces
├── libs/inference/        ← transformers + llama.cpp backends
├── models.yaml            ← model presets (shared with finetune)
└── research/              ← experiments (this tree)
    ├── finetune.py
    ├── data/
    └── evals/             ← uv workspace package
```

Research code is a **uv workspace sibling** of `apps/*` and `libs/*`. Root `pyproject.toml` declares optional dependency groups (`finetune`, `evals`, `lm-eval`) so the Docker Space image does not need to install torch-heavy extras unless you opt in locally.

## Two tracks

### Fine-tuning

`research/finetune.py` adapts a small HF causal LM on instruction or chat data. It reuses root `models.yaml` presets and the shared inference config loader, so the same `minicpm5-1b` preset used in the Gradio app can be fine-tuned without duplicating model metadata.

Outputs land in `models/finetuned/` — you can register a new preset in `models.yaml` pointing at merged weights for the **Well-Tuned** hackathon badge.

### Agentic and academic evals

`research/evals/` (`slm-evals` package) scores **whole models** on:

- **Agentic benchmarks** — BFCL, τ-bench, GAIA, SWE-bench (`slm-benchmark`)
- **Academic benchmarks** — GSM8K, ARC, HellaSwag, etc. via lm-evaluation-harness (`slm-lm-eval`)

## Data flow

```mermaid
flowchart LR
  subgraph data [research/data]
    lesson[education-lesson-chat.jsonl]
    qa[benchmark-qa.jsonl]
    kb[benchmark-kb.jsonl]
  end

  subgraph train [finetune.py]
    ckpt[models/finetuned/]
  end

  subgraph evals [slm-evals]
    bfcl[BFCL]
    tau[tau-bench]
    gaia[GAIA]
    swe[SWE-bench]
    lmeval[lm-eval tasks]
  end

  lesson --> train
  train --> ckpt
  ckpt --> evals
```

## When to use which tool

| Goal | Tool |
| ---- | ---- |
| Improve lesson slide quality on your data | `finetune.py` + optional eval before/after |
| Compare base vs LoRA on public agent tasks | `slm-benchmark` |
| Compare base vs LoRA on academic tasks | `slm-lm-eval` |
| Ship in Gradio Space | `apps/gradio-space` only — wire new weights via `models.yaml` |

## Workspace package

`research/evals` is listed in root `[tool.uv.workspace] members` as import name `slm_evals`, CLI `slm-benchmark` and `slm-lm-eval`.

Run with `uv run --package slm-evals ...` from the repo root so uv resolves workspace paths and shared lockfile versions.