SirajRLX's picture
Add Training Scripts
e527a65 verified

Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge

Trainer‑Kit is a small, config‑driven training runner for continued pretraining (CPT) on causal LMs. It supports LoRA and QLoRA, data packing (strict or padding‑masked), checkpointing + resume, JSONL logging, periodic eval with perplexity, and an optional merge step to export a final merged model.


What we built

✅ Core goals implemented

  • CPT training loop controlled entirely via a YAML config

  • Local model support (load from filesystem) and optional HF download (if repo_id is a hub id)

  • JSONL datasets for train (+ optional eval split)

  • CPT‑style token stream packing into fixed‑length blocks

  • Two packing modes

    • drop: strict CPT, drop remainder tokens (preferred for real CPT)
    • pad: pad the remainder to block_size and mask loss on padding (useful for small datasets / debugging)
  • Checkpointing + resume

    • resume_from_checkpoint: "auto" resumes from the latest checkpoint under run_dir/checkpoints
  • JSONL logs written locally

    • training logs: run_dir/logs/train.jsonl
    • eval logs: run_dir/logs/eval.jsonl
  • Evaluation

    • logs eval_loss and computed perplexity = exp(eval_loss) (with safe overflow guard)
  • Adapter output

    • saves the final/best adapter to run_dir/best_adapter
  • Merge workflow

    • --merge-only merges an existing adapter later
    • merge is done on CPU to avoid GPU OOM
    • merged model is stored under the configured merge output directory (relative to run_dir if a relative path)

Repository layout (outputs)

A run produces the following structure under run.run_dir:

runs/<run_name>/
├─ checkpoints/            # trainer checkpoints (for resume)
├─ best_adapter/           # saved LoRA adapter
├─ logs/
│  ├─ train.jsonl          # step-wise training logs
│  └─ eval.jsonl           # eval logs (eval_loss + perplexity)
├─ eval_final.json         # final eval metrics summary (if eval is enabled)
└─ config_resolved.yaml    # exact config used for the run

If merge is used, the merged model is written to:

  • run_dir/<merge.output_dir> if merge.output_dir is relative (e.g. ./merged_model)
  • or the absolute path if it is absolute.

Supported training modes

1) LoRA vs QLoRA (same script)

  • QLoRA happens when model.use_4bit: true

    • base weights are loaded in 4‑bit using bitsandbytes
    • training updates only LoRA parameters
  • LoRA happens when model.use_4bit: false

    • base weights are loaded in fp16/bf16 (as configured)
    • training updates only LoRA parameters

No “full finetune” mode is enabled by default in this runner.


Data pipeline (CPT behavior)

Input format

  • JSONL file where each line contains a text field (default "text").

  • Example:

    • {"text": "some training text..."}

Packing (token stream → fixed blocks)

  • Each sample is tokenized without truncation.
  • An EOS token is appended per document to preserve boundaries.
  • Token lists are concatenated and converted into fixed‑length blocks of data.block_size.

Two modes:

  • drop (strict CPT): remainder tokens that don’t fill a full block are discarded.

  • pad (debug/small data): remainder is padded to block_size:

    • attention_mask = 0 for padded positions
    • labels = -100 for padded positions (loss masking)

This is what allowed training to proceed even with tiny dummy datasets at block_size=1024.


Logging

Trainer‑Kit writes machine‑readable logs in JSONL.

Training logs (logs/train.jsonl)

Includes entries with:

  • step
  • loss
  • grad_norm
  • learning_rate
  • progress_pct (step progress when max_steps is active)
  • ETA estimation

Eval logs (logs/eval.jsonl)

Includes:

  • eval_loss
  • perplexity

Notes:

  • When using max_steps, the Trainer’s internal epoch counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1). Use progress_pct as the reliable indicator for step‑based runs.

Checkpointing and resume

The trainer saves checkpoints under:

  • run_dir/checkpoints/

Resume options:

  • resume_from_checkpoint: "auto" → picks the latest checkpoint automatically
  • resume_from_checkpoint: "/path/to/checkpoint" → resumes from a specific checkpoint
  • resume_from_checkpoint: null → fresh run

Merging adapters into a final model

Trainer‑Kit supports exporting a merged model:

Merge after training

  • Enable merge in config (merge.enabled: true)

  • The script will:

    1. save the adapter
    2. free GPU memory
    3. reload base model on CPU
    4. load adapter
    5. merge_and_unload()
    6. save final merged model

Merge later

Run:

python run_cpt.py --config config.yaml --merge-only

This skips training and merges run_dir/best_adapter into the base model.


How to run

Train

python run_cpt.py --config config.yaml

Merge only

python run_cpt.py --config config.yaml --merge-only