File size: 2,661 Bytes
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
871f869
59e2c8a
871f869
59e2c8a
 
 
 
 
 
 
871f869
59e2c8a
871f869
59e2c8a
871f869
 
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
871f869
59e2c8a
 
 
 
 
 
 
 
 
 
 
 
 
871f869
59e2c8a
 
871f869
59e2c8a
871f869
59e2c8a
871f869
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Research overview

How `research/` relates to the main hackathon repo and what each component does.

## Position in the repo

```text
small-model-hackathon/
β”œβ”€β”€ apps/gradio-space/     ← shipped Lesson Agent UI
β”œβ”€β”€ libs/agent/            ← skill loop, tools, traces
β”œβ”€β”€ libs/inference/        ← transformers + llama.cpp backends
β”œβ”€β”€ models.yaml            ← model presets (shared with finetune)
└── research/              ← experiments (this tree)
    β”œβ”€β”€ finetune.py
    β”œβ”€β”€ data/
    └── evals/             ← uv workspace package
```

Research code is a **uv workspace sibling** of `apps/*` and `libs/*`. Root `pyproject.toml` declares optional dependency groups (`finetune`, `evals`, `lm-eval`) so the Docker Space image does not need to install torch-heavy extras unless you opt in locally.

## Two tracks

### Fine-tuning

`research/finetune.py` adapts a small HF causal LM on instruction or chat data. It reuses root `models.yaml` presets and the shared inference config loader, so the same `minicpm5-1b` preset used in the Gradio app can be fine-tuned without duplicating model metadata.

Outputs land in `models/finetuned/` β€” you can register a new preset in `models.yaml` pointing at merged weights for the **Well-Tuned** hackathon badge.

### Agentic and academic evals

`research/evals/` (`slm-evals` package) scores **whole models** on:

- **Agentic benchmarks** β€” BFCL, Ο„-bench, GAIA, SWE-bench (`slm-benchmark`)
- **Academic benchmarks** β€” GSM8K, ARC, HellaSwag, etc. via lm-evaluation-harness (`slm-lm-eval`)

## Data flow

```mermaid
flowchart LR
  subgraph data [research/data]
    lesson[education-lesson-chat.jsonl]
    qa[benchmark-qa.jsonl]
    kb[benchmark-kb.jsonl]
  end

  subgraph train [finetune.py]
    ckpt[models/finetuned/]
  end

  subgraph evals [slm-evals]
    bfcl[BFCL]
    tau[tau-bench]
    gaia[GAIA]
    swe[SWE-bench]
    lmeval[lm-eval tasks]
  end

  lesson --> train
  train --> ckpt
  ckpt --> evals
```

## When to use which tool

| Goal | Tool |
| ---- | ---- |
| Improve lesson slide quality on your data | `finetune.py` + optional eval before/after |
| Compare base vs LoRA on public agent tasks | `slm-benchmark` |
| Compare base vs LoRA on academic tasks | `slm-lm-eval` |
| Ship in Gradio Space | `apps/gradio-space` only β€” wire new weights via `models.yaml` |

## Workspace package

`research/evals` is listed in root `[tool.uv.workspace] members` as import name `slm_evals`, CLI `slm-benchmark` and `slm-lm-eval`.

Run with `uv run --package slm-evals ...` from the repo root so uv resolves workspace paths and shared lockfile versions.