task2file-llm / trainer-kit /CPT /README.md

SirajRLX

Add Training Scripts

e527a65 verified about 1 month ago

preview code

raw

history blame contribute delete

5.17 kB

Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge

Trainer‑Kit is a small, config‑driven training runner for continued pretraining (CPT) on causal LMs. It supports LoRA and QLoRA, data packing (strict or padding‑masked), checkpointing + resume, JSONL logging, periodic eval with perplexity, and an optional merge step to export a final merged model.

What we built

✅ Core goals implemented

CPT training loop controlled entirely via a YAML config
Local model support (load from filesystem) and optional HF download (if repo_id is a hub id)
JSONL datasets for train (+ optional eval split)
CPT‑style token stream packing into fixed‑length blocks
Two packing modes
- drop: strict CPT, drop remainder tokens (preferred for real CPT)
- pad: pad the remainder to block_size and mask loss on padding (useful for small datasets / debugging)
Checkpointing + resume
- resume_from_checkpoint: "auto" resumes from the latest checkpoint under run_dir/checkpoints
JSONL logs written locally
- training logs: run_dir/logs/train.jsonl
- eval logs: run_dir/logs/eval.jsonl
Evaluation
- logs eval_loss and computed perplexity = exp(eval_loss) (with safe overflow guard)
Adapter output
- saves the final/best adapter to run_dir/best_adapter
Merge workflow
- --merge-only merges an existing adapter later
- merge is done on CPU to avoid GPU OOM
- merged model is stored under the configured merge output directory (relative to run_dir if a relative path)

Repository layout (outputs)

A run produces the following structure under run.run_dir:

runs/<run_name>/
├─ checkpoints/            # trainer checkpoints (for resume)
├─ best_adapter/           # saved LoRA adapter
├─ logs/
│  ├─ train.jsonl          # step-wise training logs
│  └─ eval.jsonl           # eval logs (eval_loss + perplexity)
├─ eval_final.json         # final eval metrics summary (if eval is enabled)
└─ config_resolved.yaml    # exact config used for the run

If merge is used, the merged model is written to:

run_dir/<merge.output_dir> if merge.output_dir is relative (e.g. ./merged_model)
or the absolute path if it is absolute.

Supported training modes

1) LoRA vs QLoRA (same script)

QLoRA happens when model.use_4bit: true
- base weights are loaded in 4‑bit using bitsandbytes
- training updates only LoRA parameters
LoRA happens when model.use_4bit: false
- base weights are loaded in fp16/bf16 (as configured)
- training updates only LoRA parameters

No “full finetune” mode is enabled by default in this runner.

Data pipeline (CPT behavior)

Input format

JSONL file where each line contains a text field (default "text").
Example:
- {"text": "some training text..."}

Packing (token stream → fixed blocks)

Each sample is tokenized without truncation.
An EOS token is appended per document to preserve boundaries.
Token lists are concatenated and converted into fixed‑length blocks of data.block_size.

Two modes:

drop (strict CPT): remainder tokens that don’t fill a full block are discarded.
pad (debug/small data): remainder is padded to block_size:
- attention_mask = 0 for padded positions
- labels = -100 for padded positions (loss masking)

This is what allowed training to proceed even with tiny dummy datasets at block_size=1024.

Logging

Trainer‑Kit writes machine‑readable logs in JSONL.

Training logs (`logs/train.jsonl`)

Includes entries with:

step
loss
grad_norm
learning_rate
progress_pct (step progress when max_steps is active)
ETA estimation

Eval logs (`logs/eval.jsonl`)

Includes:

eval_loss
perplexity

Notes:

When using max_steps, the Trainer’s internal epoch counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1). Use progress_pct as the reliable indicator for step‑based runs.

Checkpointing and resume

The trainer saves checkpoints under:

run_dir/checkpoints/

Resume options:

resume_from_checkpoint: "auto" → picks the latest checkpoint automatically
resume_from_checkpoint: "/path/to/checkpoint" → resumes from a specific checkpoint
resume_from_checkpoint: null → fresh run

Merging adapters into a final model

Trainer‑Kit supports exporting a merged model:

Merge after training

Enable merge in config (merge.enabled: true)
The script will:
1. save the adapter
2. free GPU memory
3. reload base model on CPU
4. load adapter
5. merge_and_unload()
6. save final merged model

Merge later

Run:

python run_cpt.py --config config.yaml --merge-only

This skips training and merges run_dir/best_adapter into the base model.

How to run

Train

python run_cpt.py --config config.yaml

Merge only

python run_cpt.py --config config.yaml --merge-only