File size: 5,171 Bytes
e527a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge

Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs.
It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model.

---

## What we built

### ✅ Core goals implemented

* **CPT training loop** controlled entirely via a **YAML config**
* **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id)
* **JSONL datasets** for train (+ optional eval split)
* **CPT‑style token stream packing** into fixed‑length blocks
* **Two packing modes**

  * `drop`: strict CPT, drop remainder tokens (preferred for real CPT)
  * `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging)
* **Checkpointing + resume**

  * `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints`
* **JSONL logs** written locally

  * training logs: `run_dir/logs/train.jsonl`
  * eval logs: `run_dir/logs/eval.jsonl`
* **Evaluation**

  * logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard)
* **Adapter output**

  * saves the final/best adapter to `run_dir/best_adapter`
* **Merge workflow**

  * `--merge-only` merges an existing adapter later
  * merge is done **on CPU** to avoid GPU OOM
  * merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path)

---

## Repository layout (outputs)

A run produces the following structure under `run.run_dir`:

```
runs/<run_name>/
├─ checkpoints/            # trainer checkpoints (for resume)
├─ best_adapter/           # saved LoRA adapter
├─ logs/
│  ├─ train.jsonl          # step-wise training logs
│  └─ eval.jsonl           # eval logs (eval_loss + perplexity)
├─ eval_final.json         # final eval metrics summary (if eval is enabled)
└─ config_resolved.yaml    # exact config used for the run
```

If merge is used, the merged model is written to:

* `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`)
* or the absolute path if it is absolute.

---

## Supported training modes

### 1) LoRA vs QLoRA (same script)

* **QLoRA** happens when `model.use_4bit: true`

  * base weights are loaded in 4‑bit using bitsandbytes
  * training updates only LoRA parameters
* **LoRA** happens when `model.use_4bit: false`

  * base weights are loaded in fp16/bf16 (as configured)
  * training updates only LoRA parameters

No “full finetune” mode is enabled by default in this runner.

---

## Data pipeline (CPT behavior)

### Input format

* JSONL file where each line contains a text field (default `"text"`).
* Example:

  * `{"text": "some training text..."}`

### Packing (token stream → fixed blocks)

* Each sample is tokenized without truncation.
* An **EOS token is appended** per document to preserve boundaries.
* Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`.

Two modes:

* **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded.
* **`pad` (debug/small data):** remainder is padded to block_size:

  * `attention_mask = 0` for padded positions
  * `labels = -100` for padded positions (loss masking)

This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`.

---

## Logging

Trainer‑Kit writes **machine‑readable logs** in JSONL.

### Training logs (`logs/train.jsonl`)

Includes entries with:

* `step`
* `loss`
* `grad_norm`
* `learning_rate`
* `progress_pct` (step progress when `max_steps` is active)
* ETA estimation

### Eval logs (`logs/eval.jsonl`)

Includes:

* `eval_loss`
* `perplexity`

Notes:

* When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1).
  **Use `progress_pct` as the reliable indicator** for step‑based runs.

---

## Checkpointing and resume

The trainer saves checkpoints under:

* `run_dir/checkpoints/`

Resume options:

* `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically
* `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint
* `resume_from_checkpoint: null` → fresh run

---

## Merging adapters into a final model

Trainer‑Kit supports exporting a merged model:

### Merge after training

* Enable merge in config (`merge.enabled: true`)
* The script will:

  1. save the adapter
  2. free GPU memory
  3. reload base model on **CPU**
  4. load adapter
  5. `merge_and_unload()`
  6. save final merged model

### Merge later

Run:

```
python run_cpt.py --config config.yaml --merge-only
```

This skips training and merges `run_dir/best_adapter` into the base model.

---

## How to run

### Train

```
python run_cpt.py --config config.yaml
```

### Merge only

```
python run_cpt.py --config config.yaml --merge-only