| # Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge | |
| Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs. | |
| It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model. | |
| --- | |
| ## What we built | |
| ### ✅ Core goals implemented | |
| * **CPT training loop** controlled entirely via a **YAML config** | |
| * **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id) | |
| * **JSONL datasets** for train (+ optional eval split) | |
| * **CPT‑style token stream packing** into fixed‑length blocks | |
| * **Two packing modes** | |
| * `drop`: strict CPT, drop remainder tokens (preferred for real CPT) | |
| * `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging) | |
| * **Checkpointing + resume** | |
| * `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints` | |
| * **JSONL logs** written locally | |
| * training logs: `run_dir/logs/train.jsonl` | |
| * eval logs: `run_dir/logs/eval.jsonl` | |
| * **Evaluation** | |
| * logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard) | |
| * **Adapter output** | |
| * saves the final/best adapter to `run_dir/best_adapter` | |
| * **Merge workflow** | |
| * `--merge-only` merges an existing adapter later | |
| * merge is done **on CPU** to avoid GPU OOM | |
| * merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path) | |
| --- | |
| ## Repository layout (outputs) | |
| A run produces the following structure under `run.run_dir`: | |
| ``` | |
| runs/<run_name>/ | |
| ├─ checkpoints/ # trainer checkpoints (for resume) | |
| ├─ best_adapter/ # saved LoRA adapter | |
| ├─ logs/ | |
| │ ├─ train.jsonl # step-wise training logs | |
| │ └─ eval.jsonl # eval logs (eval_loss + perplexity) | |
| ├─ eval_final.json # final eval metrics summary (if eval is enabled) | |
| └─ config_resolved.yaml # exact config used for the run | |
| ``` | |
| If merge is used, the merged model is written to: | |
| * `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`) | |
| * or the absolute path if it is absolute. | |
| --- | |
| ## Supported training modes | |
| ### 1) LoRA vs QLoRA (same script) | |
| * **QLoRA** happens when `model.use_4bit: true` | |
| * base weights are loaded in 4‑bit using bitsandbytes | |
| * training updates only LoRA parameters | |
| * **LoRA** happens when `model.use_4bit: false` | |
| * base weights are loaded in fp16/bf16 (as configured) | |
| * training updates only LoRA parameters | |
| No “full finetune” mode is enabled by default in this runner. | |
| --- | |
| ## Data pipeline (CPT behavior) | |
| ### Input format | |
| * JSONL file where each line contains a text field (default `"text"`). | |
| * Example: | |
| * `{"text": "some training text..."}` | |
| ### Packing (token stream → fixed blocks) | |
| * Each sample is tokenized without truncation. | |
| * An **EOS token is appended** per document to preserve boundaries. | |
| * Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`. | |
| Two modes: | |
| * **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded. | |
| * **`pad` (debug/small data):** remainder is padded to block_size: | |
| * `attention_mask = 0` for padded positions | |
| * `labels = -100` for padded positions (loss masking) | |
| This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`. | |
| --- | |
| ## Logging | |
| Trainer‑Kit writes **machine‑readable logs** in JSONL. | |
| ### Training logs (`logs/train.jsonl`) | |
| Includes entries with: | |
| * `step` | |
| * `loss` | |
| * `grad_norm` | |
| * `learning_rate` | |
| * `progress_pct` (step progress when `max_steps` is active) | |
| * ETA estimation | |
| ### Eval logs (`logs/eval.jsonl`) | |
| Includes: | |
| * `eval_loss` | |
| * `perplexity` | |
| Notes: | |
| * When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1). | |
| **Use `progress_pct` as the reliable indicator** for step‑based runs. | |
| --- | |
| ## Checkpointing and resume | |
| The trainer saves checkpoints under: | |
| * `run_dir/checkpoints/` | |
| Resume options: | |
| * `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically | |
| * `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint | |
| * `resume_from_checkpoint: null` → fresh run | |
| --- | |
| ## Merging adapters into a final model | |
| Trainer‑Kit supports exporting a merged model: | |
| ### Merge after training | |
| * Enable merge in config (`merge.enabled: true`) | |
| * The script will: | |
| 1. save the adapter | |
| 2. free GPU memory | |
| 3. reload base model on **CPU** | |
| 4. load adapter | |
| 5. `merge_and_unload()` | |
| 6. save final merged model | |
| ### Merge later | |
| Run: | |
| ``` | |
| python run_cpt.py --config config.yaml --merge-only | |
| ``` | |
| This skips training and merges `run_dir/best_adapter` into the base model. | |
| --- | |
| ## How to run | |
| ### Train | |
| ``` | |
| python run_cpt.py --config config.yaml | |
| ``` | |
| ### Merge only | |
| ``` | |
| python run_cpt.py --config config.yaml --merge-only | |