task2file-llm / trainer-kit /CPT /README.md

Add Training Scripts

e527a65 verified about 1 month ago

5.17 kB

	# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge

	Trainer‑Kit is a small, config‑driven training runner for continued pretraining (CPT) on causal LMs.
	It supports LoRA and QLoRA, data packing (strict or padding‑masked), checkpointing + resume, JSONL logging, periodic eval with perplexity, and an optional merge step to export a final merged model.

	---

	## What we built

	### ✅ Core goals implemented

	* CPT training loop controlled entirely via a YAML config
	* Local model support (load from filesystem) and optional HF download (if `repo_id` is a hub id)
	* JSONL datasets for train (+ optional eval split)
	* CPT‑style token stream packing into fixed‑length blocks
	* Two packing modes

	* `drop`: strict CPT, drop remainder tokens (preferred for real CPT)
	* `pad`: pad the remainder to `block_size` and mask loss on padding (useful for small datasets / debugging)
	* Checkpointing + resume

	* `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints`
	* JSONL logs written locally

	* training logs: `run_dir/logs/train.jsonl`
	* eval logs: `run_dir/logs/eval.jsonl`
	* Evaluation

	* logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard)
	* Adapter output

	* saves the final/best adapter to `run_dir/best_adapter`
	* Merge workflow

	* `--merge-only` merges an existing adapter later
	* merge is done on CPU to avoid GPU OOM
	* merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path)

	---

	## Repository layout (outputs)

	A run produces the following structure under `run.run_dir`:

	```
	runs/<run_name>/
	├─ checkpoints/ # trainer checkpoints (for resume)
	├─ best_adapter/ # saved LoRA adapter
	├─ logs/
	│ ├─ train.jsonl # step-wise training logs
	│ └─ eval.jsonl # eval logs (eval_loss + perplexity)
	├─ eval_final.json # final eval metrics summary (if eval is enabled)
	└─ config_resolved.yaml # exact config used for the run
	```

	If merge is used, the merged model is written to:

	* `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`)
	* or the absolute path if it is absolute.

	---

	## Supported training modes

	### 1) LoRA vs QLoRA (same script)

	* QLoRA happens when `model.use_4bit: true`

	* base weights are loaded in 4‑bit using bitsandbytes
	* training updates only LoRA parameters
	* LoRA happens when `model.use_4bit: false`

	* base weights are loaded in fp16/bf16 (as configured)
	* training updates only LoRA parameters

	No “full finetune” mode is enabled by default in this runner.

	---

	## Data pipeline (CPT behavior)

	### Input format

	* JSONL file where each line contains a text field (default `"text"`).
	* Example:

	* `{"text": "some training text..."}`

	### Packing (token stream → fixed blocks)

	* Each sample is tokenized without truncation.
	* An EOS token is appended per document to preserve boundaries.
	* Token lists are concatenated and converted into fixed‑length blocks of `data.block_size`.

	Two modes:

	* `drop` (strict CPT): remainder tokens that don’t fill a full block are discarded.
	* `pad` (debug/small data): remainder is padded to block_size:

	* `attention_mask = 0` for padded positions
	* `labels = -100` for padded positions (loss masking)

	This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`.

	---

	## Logging

	Trainer‑Kit writes machine‑readable logs in JSONL.

	### Training logs (`logs/train.jsonl`)

	Includes entries with:

	* `step`
	* `loss`
	* `grad_norm`
	* `learning_rate`
	* `progress_pct` (step progress when `max_steps` is active)
	* ETA estimation

	### Eval logs (`logs/eval.jsonl`)

	Includes:

	* `eval_loss`
	* `perplexity`

	Notes:

	* When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1).
	Use `progress_pct` as the reliable indicator for step‑based runs.

	---

	## Checkpointing and resume

	The trainer saves checkpoints under:

	* `run_dir/checkpoints/`

	Resume options:

	* `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically
	* `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint
	* `resume_from_checkpoint: null` → fresh run

	---

	## Merging adapters into a final model

	Trainer‑Kit supports exporting a merged model:

	### Merge after training

	* Enable merge in config (`merge.enabled: true`)
	* The script will:

	1. save the adapter
	2. free GPU memory
	3. reload base model on CPU
	4. load adapter
	5. `merge_and_unload()`
	6. save final merged model

	### Merge later

	Run:

	```
	python run_cpt.py --config config.yaml --merge-only
	```

	This skips training and merges `run_dir/best_adapter` into the base model.

	---

	## How to run

	### Train

	```
	python run_cpt.py --config config.yaml
	```

	### Merge only

	```
	python run_cpt.py --config config.yaml --merge-only