# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs. It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model. --- ## What we built ### ✅ Core goals implemented * **CPT training loop** controlled entirely via a **YAML config** * **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id) * **JSONL datasets** for train (+ optional eval split) * **CPT‑style token stream packing** into fixed‑length blocks * **Two packing modes** * `drop`: strict CPT, drop remainder tokens (preferred for real CPT) * `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging) * **Checkpointing + resume** * `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints` * **JSONL logs** written locally * training logs: `run_dir/logs/train.jsonl` * eval logs: `run_dir/logs/eval.jsonl` * **Evaluation** * logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard) * **Adapter output** * saves the final/best adapter to `run_dir/best_adapter` * **Merge workflow** * `--merge-only` merges an existing adapter later * merge is done **on CPU** to avoid GPU OOM * merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path) --- ## Repository layout (outputs) A run produces the following structure under `run.run_dir`: ``` runs// ├─ checkpoints/ # trainer checkpoints (for resume) ├─ best_adapter/ # saved LoRA adapter ├─ logs/ │ ├─ train.jsonl # step-wise training logs │ └─ eval.jsonl # eval logs (eval_loss + perplexity) ├─ eval_final.json # final eval metrics summary (if eval is enabled) └─ config_resolved.yaml # exact config used for the run ``` If merge is used, the merged model is written to: * `run_dir/` if `merge.output_dir` is relative (e.g. `./merged_model`) * or the absolute path if it is absolute. --- ## Supported training modes ### 1) LoRA vs QLoRA (same script) * **QLoRA** happens when `model.use_4bit: true` * base weights are loaded in 4‑bit using bitsandbytes * training updates only LoRA parameters * **LoRA** happens when `model.use_4bit: false` * base weights are loaded in fp16/bf16 (as configured) * training updates only LoRA parameters No “full finetune” mode is enabled by default in this runner. --- ## Data pipeline (CPT behavior) ### Input format * JSONL file where each line contains a text field (default `"text"`). * Example: * `{"text": "some training text..."}` ### Packing (token stream → fixed blocks) * Each sample is tokenized without truncation. * An **EOS token is appended** per document to preserve boundaries. * Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`. Two modes: * **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded. * **`pad` (debug/small data):** remainder is padded to block_size: * `attention_mask = 0` for padded positions * `labels = -100` for padded positions (loss masking) This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`. --- ## Logging Trainer‑Kit writes **machine‑readable logs** in JSONL. ### Training logs (`logs/train.jsonl`) Includes entries with: * `step` * `loss` * `grad_norm` * `learning_rate` * `progress_pct` (step progress when `max_steps` is active) * ETA estimation ### Eval logs (`logs/eval.jsonl`) Includes: * `eval_loss` * `perplexity` Notes: * When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1). **Use `progress_pct` as the reliable indicator** for step‑based runs. --- ## Checkpointing and resume The trainer saves checkpoints under: * `run_dir/checkpoints/` Resume options: * `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically * `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint * `resume_from_checkpoint: null` → fresh run --- ## Merging adapters into a final model Trainer‑Kit supports exporting a merged model: ### Merge after training * Enable merge in config (`merge.enabled: true`) * The script will: 1. save the adapter 2. free GPU memory 3. reload base model on **CPU** 4. load adapter 5. `merge_and_unload()` 6. save final merged model ### Merge later Run: ``` python run_cpt.py --config config.yaml --merge-only ``` This skips training and merges `run_dir/best_adapter` into the base model. --- ## How to run ### Train ``` python run_cpt.py --config config.yaml ``` ### Merge only ``` python run_cpt.py --config config.yaml --merge-only