SirajRLX's picture
Add Training Scripts
e527a65 verified
# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge
Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs.
It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model.
---
## What we built
### ✅ Core goals implemented
* **CPT training loop** controlled entirely via a **YAML config**
* **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id)
* **JSONL datasets** for train (+ optional eval split)
* **CPT‑style token stream packing** into fixed‑length blocks
* **Two packing modes**
* `drop`: strict CPT, drop remainder tokens (preferred for real CPT)
* `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging)
* **Checkpointing + resume**
* `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints`
* **JSONL logs** written locally
* training logs: `run_dir/logs/train.jsonl`
* eval logs: `run_dir/logs/eval.jsonl`
* **Evaluation**
* logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard)
* **Adapter output**
* saves the final/best adapter to `run_dir/best_adapter`
* **Merge workflow**
* `--merge-only` merges an existing adapter later
* merge is done **on CPU** to avoid GPU OOM
* merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path)
---
## Repository layout (outputs)
A run produces the following structure under `run.run_dir`:
```
runs/<run_name>/
├─ checkpoints/ # trainer checkpoints (for resume)
├─ best_adapter/ # saved LoRA adapter
├─ logs/
│ ├─ train.jsonl # step-wise training logs
│ └─ eval.jsonl # eval logs (eval_loss + perplexity)
├─ eval_final.json # final eval metrics summary (if eval is enabled)
└─ config_resolved.yaml # exact config used for the run
```
If merge is used, the merged model is written to:
* `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`)
* or the absolute path if it is absolute.
---
## Supported training modes
### 1) LoRA vs QLoRA (same script)
* **QLoRA** happens when `model.use_4bit: true`
* base weights are loaded in 4‑bit using bitsandbytes
* training updates only LoRA parameters
* **LoRA** happens when `model.use_4bit: false`
* base weights are loaded in fp16/bf16 (as configured)
* training updates only LoRA parameters
No “full finetune” mode is enabled by default in this runner.
---
## Data pipeline (CPT behavior)
### Input format
* JSONL file where each line contains a text field (default `"text"`).
* Example:
* `{"text": "some training text..."}`
### Packing (token stream → fixed blocks)
* Each sample is tokenized without truncation.
* An **EOS token is appended** per document to preserve boundaries.
* Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`.
Two modes:
* **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded.
* **`pad` (debug/small data):** remainder is padded to block_size:
* `attention_mask = 0` for padded positions
* `labels = -100` for padded positions (loss masking)
This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`.
---
## Logging
Trainer‑Kit writes **machine‑readable logs** in JSONL.
### Training logs (`logs/train.jsonl`)
Includes entries with:
* `step`
* `loss`
* `grad_norm`
* `learning_rate`
* `progress_pct` (step progress when `max_steps` is active)
* ETA estimation
### Eval logs (`logs/eval.jsonl`)
Includes:
* `eval_loss`
* `perplexity`
Notes:
* When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1).
**Use `progress_pct` as the reliable indicator** for step‑based runs.
---
## Checkpointing and resume
The trainer saves checkpoints under:
* `run_dir/checkpoints/`
Resume options:
* `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically
* `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint
* `resume_from_checkpoint: null` → fresh run
---
## Merging adapters into a final model
Trainer‑Kit supports exporting a merged model:
### Merge after training
* Enable merge in config (`merge.enabled: true`)
* The script will:
1. save the adapter
2. free GPU memory
3. reload base model on **CPU**
4. load adapter
5. `merge_and_unload()`
6. save final merged model
### Merge later
Run:
```
python run_cpt.py --config config.yaml --merge-only
```
This skips training and merges `run_dir/best_adapter` into the base model.
---
## How to run
### Train
```
python run_cpt.py --config config.yaml
```
### Merge only
```
python run_cpt.py --config config.yaml --merge-only