task2file-llm / trainer-kit /CPT /detailed_parameter_documentation.md

SirajRLX

Add Training Scripts

e527a65 verified about 1 month ago

preview code

raw

history blame contribute delete

23.4 kB

CPT Configuration Parameters: Detailed Guide

This document provides a comprehensive explanation of all configuration parameters in config.yaml and how they're implemented in run_cpt.py.

Run Parameters
Model Parameters
Data Parameters
PEFT Parameters
Training Parameters
Merge Parameters

Run Parameters

`run.run_dir`

Type: String (path)
Required: Yes
Default: No default
Description: Directory where training outputs will be saved
Used in: Line ~480 in run_cpt.py

Implementation:

run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))

Example Values:
- ./runs/cpt_run_v1
- /workspace/outputs/my_experiment
- ./checkpoints/cpt_experiment

`run.seed`

Type: Integer
Required: No
Default: None
Description: Random seed for reproducibility
Used in: Lines ~460, ~240 in run_cpt.py

Implementation:

set_seed(int(cfg["run"].get("seed", 42)))
# Used in data shuffling and train/test split

Example Values: 42, 123, 2023

Model Parameters

`model.repo_id`

Type: String (path or HuggingFace repo)
Required: Yes
Default: No default
Description: Model identifier - can be local path or HuggingFace repository
Used in: Lines ~480-500 in run_cpt.py

Implementation:

repo_id = str(model_cfg["repo_id"]).strip()
repo_path = Path(repo_id)
if repo_path.exists() and repo_path.is_dir():
    base_dir = repo_path  # Local path
else:
    # Download from HuggingFace
    snapshot_download(repo_id=repo_id, ...)

Example Values:
- Local: /workspace/Models/Devstral-Small-2-24B-Instruct-2512
- HF Repo: meta-llama/Llama-2-7b-hf

`model.revision`

Type: String or null
Required: No
Default: null
Description: Specific model revision/branch/tag from HuggingFace
Used in: Line ~495 in run_cpt.py

Implementation:

snapshot_download(
    repo_id=repo_id,
    revision=model_cfg.get("revision", None),
    ...
)

Example Values:
- "main" - Main branch
- "v1.0" - Specific tag
- "abc123def" - Specific commit hash
- null - Latest version

`model.base_local_dir`

Type: String (path)
Required: No
Default: "base_model"
Description: Directory name for downloaded model when using HF repo
Used in: Line ~495 in run_cpt.py

Implementation:

base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))

Example Values: "base_model", "downloaded_model", "model_files"

`model.trust_remote_code`

Type: Boolean
Required: No
Default: true
Description: Allow loading models with custom code
Used in: Lines ~320, ~340, ~450 in run_cpt.py

Implementation:

tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)

Example Values: true, false

`model.tokenizer_use_fast`

Type: Boolean
Required: No
Default: true
Description: Use fast tokenizer implementation
Used in: Lines ~320, ~450 in run_cpt.py

Implementation:

tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)

Example Values: true, false

`model.device_map`

Type: String
Required: No
Default: "auto"
Description: How to distribute model across devices
Used in: Lines ~350, ~370 in run_cpt.py

Implementation:

model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)

Example Values:
- "auto" - Automatic distribution
- "cpu" - CPU only
- "cuda:0" - Single GPU
- {"": 0} - Manual mapping

`model.torch_dtype`

Type: String
Required: No
Default: "bfloat16"
Description: Data type for model tensors
Used in: Lines ~45, ~350 in run_cpt.py

Implementation:

def _dtype_from_str(s: str) -> torch.dtype:
    if s in ("float16", "fp16"): return torch.float16
    if s in ("bfloat16", "bf16"): return torch.bfloat16  
    if s in ("float32", "fp32"): return torch.float32

Example Values:
- "float16" - 16-bit floats (faster, less memory, less stable)
- "bfloat16" - Brain float16 (stable, good for training)
- "float32" - 32-bit floats (slowest, most memory)

`model.use_4bit`

Type: Boolean
Required: No
Default: false
Description: Use 4-bit quantization for memory efficiency
Used in: Lines ~325, ~395 in run_cpt.py

Implementation:

use_4bit = bool(model_cfg.get("use_4bit", False))
if use_4bit:
    quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)

Example Values: true, false

`model.bnb_4bit_quant_type`

Type: String
Required: No
Default: "nf4"
Description: 4-bit quantization type
Used in: Lines ~328 in run_cpt.py

Implementation:

bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))

Example Values:
- "nf4" - NormalFloat4 (recommended)
- "fp4" - FloatingPoint4
- "int4" - Integer4

`model.bnb_4bit_use_double_quant`

Type: Boolean
Required: No
Default: false
Description: Use double quantization for memory efficiency
Used in: Lines ~329 in run_cpt.py

Implementation:

bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))

Example Values: true, false

`model.bnb_4bit_compute_dtype`

Type: String
Required: No
Default: "bfloat16"
Description: Compute dtype for 4-bit quantization
Used in: Lines ~330 in run_cpt.py

Implementation:

bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))

Example Values: "float16", "bfloat16", "float32"

`model.attn_implementation`

Type: String or null
Required: No
Default: null
Description: Attention implementation to use
Used in: Lines ~155, ~350 in run_cpt.py

Implementation:

def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
    return cfg.get("model", {}).get("attn_implementation", None)
# Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)

Example Values:
- "flash_attention_2" - Flash Attention 2 (fastest)
- "sdpa" - Scaled Dot-Product Attention
- null - Default implementation

Data Parameters

`data.train_jsonl`

Type: String (path)
Required: Yes
Default: No default
Description: Path to training data in JSONL format
Used in: Lines ~170 in run_cpt.py

Implementation:

train_path = data_cfg["train_jsonl"]
ds = load_dataset("json", data_files={"train": train_path})

Example Values: "/workspace/all_data_with_descriptions.jsonl"

`data.eval_jsonl`

Type: String (path) or null
Required: No
Default: null
Description: Path to evaluation data in JSONL format
Used in: Lines ~175 in run_cpt.py

Implementation:

eval_path = data_cfg.get("eval_jsonl", None)
if eval_path:
    ds_eval = load_dataset("json", data_files={"eval": eval_path})

Example Values: null (no separate eval file), "/workspace/eval_data.jsonl"

`data.eval_split_ratio`

Type: Float
Required: No
Default: 0.1
Description: Ratio of training data to use for evaluation split
Used in: Lines ~177 in run_cpt.py

Implementation:

split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
if 0.0 < split_ratio < 1.0:
    split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)

Example Values: 0.1 (10%), 0.2 (20%), 0.05 (5%)

`data.text_field`

Type: String
Required: No
Default: "text"
Description: Field name in JSONL containing the text data
Used in: Lines ~185 in run_cpt.py

Implementation:

text_field = data_cfg.get("text_field", "text")
# Used in tokenization
tokenized = dsd["train"].map(
    tokenize_fn,
    batched=True,
    remove_columns=dsd["train"].column_names,
    desc="Tokenizing train",
)

Example Values: "text", "content", "prompt", "input"

`data.block_size`

Type: Integer
Required: No
Default: 4096
Description: Maximum sequence length for training
Used in: Lines ~180 in run_cpt.py

Implementation:

block_size = int(data_cfg.get("block_size", 2048))
# Used in grouping texts into blocks
for i in range(0, full_len, block_size):
    chunk = concatenated["input_ids"][i:i + block_size]

Example Values: 2048, 4096, 8192

`data.shuffle`

Type: Boolean
Required: No
Default: true
Description: Whether to shuffle training data
Used in: Lines ~235 in run_cpt.py

Implementation:

if shuffle:
    tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))

Example Values: true, false

`data.num_proc`

Type: Integer
Required: No
Default: 4
Description: Number of processes for data loading
Used in: Lines ~200, ~210 in run_cpt.py

Implementation:

num_proc = int(data_cfg.get("num_proc", 4))
tokenized_train = dsd["train"].map(
    tokenize_fn,
    batched=True,
    num_proc=num_proc,
    ...
)

Example Values: 1, 4, 8, 16

`data.pack_mode`

Type: String
Required: No
Default: "pad"
Description: How to handle remainder tokens in final block
Used in: Lines ~150-230 in run_cpt.py

Implementation:

pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
if pack_mode == "pad":
    # Pad remainder and mask loss
    labels[-pad_len:] = [-100] * pad_len
# If "drop": ignore remainder entirely

Example Values:
- "drop" - Drop incomplete blocks (strict CPT)
- "pad" - Pad incomplete blocks with masked loss

PEFT Parameters

`peft.enabled`

Type: Boolean
Required: No
Default: true
Description: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
Used in: Lines ~395 in run_cpt.py

Implementation:

if not bool(peft_cfg.get("enabled", True)):
    return model, None
# Otherwise proceed with LoRA configuration

Example Values: true, false

`peft.r`

Type: Integer
Required: No
Default: 64
Description: LoRA rank - dimension of low-rank matrices
Used in: Lines ~415 in run_cpt.py

Implementation:

lora_config = LoraConfig(
    r=int(peft_cfg.get("r", 16)),
    ...
)

Example Values: 8, 16, 32, 64, 128
Note: Higher values = more parameters but potentially better performance

`peft.lora_alpha`

Type: Integer
Required: No
Default: 128
Description: LoRA alpha scaling parameter
Used in: Lines ~416 in run_cpt.py

Implementation:

lora_config = LoraConfig(
    lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
    ...
)

Example Values: 16, 32, 64, 128, 256

`peft.lora_dropout`

Type: Float
Required: No
Default: 0.05
Description: Dropout rate for LoRA layers
Used in: Lines ~417 in run_cpt.py

Implementation:

lora_config = LoraConfig(
    lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
    ...
)

Example Values: 0.0, 0.05, 0.1, 0.2

`peft.bias`

Type: String
Required: No
Default: "none"
Description: Bias training strategy
Used in: Lines ~418 in run_cpt.py

Implementation:

lora_config = LoraConfig(
    bias=str(peft_cfg.get("bias", "none")),
    ...
)

Example Values:
- "none" - No bias training
- "all" - Train all biases
- "lora_only" - Only LoRA bias

`peft.target_modules`

Type: String or List
Required: No
Default: "auto"
Description: Which modules to apply LoRA to
Used in: Lines ~405, ~140-170 in run_cpt.py

Implementation:

target_modules = peft_cfg.get("target_modules", "auto")
if target_modules == "auto":
    target_modules = _infer_target_modules(model)

Example Values:
- "auto" - Automatic detection
- ["q_proj", "k_proj", "v_proj", "o_proj"] - Explicit list
- ["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"] - MLP only

Training Parameters

`train.num_train_epochs`

Type: Float
Required: No
Default: 2
Description: Number of epochs to train
Used in: Lines ~470 in run_cpt.py

Implementation:

num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
# Used in TrainingArguments

Example Values: 1.0, 2.0, 3.5

`train.per_device_train_batch_size`

Type: Integer
Required: No
Default: 1
Description: Training batch size per device
Used in: Lines ~475 in run_cpt.py

Implementation:

per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))

Example Values: 1, 2, 4, 8

`train.per_device_eval_batch_size`

Type: Integer
Required: No
Default: Same as train batch size
Description: Evaluation batch size per device
Used in: Lines ~476 in run_cpt.py

Implementation:

per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))

Example Values: 1, 2, 4, 8

`train.gradient_accumulation_steps`

Type: Integer
Required: No
Default: 16
Description: Number of steps to accumulate gradients
Used in: Lines ~477 in run_cpt.py

Implementation:

gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))

Example Values: 1, 4, 8, 16, 32

`train.learning_rate`

Type: Float
Required: No
Default: 2e-5
Description: Learning rate for optimizer
Used in: Lines ~478 in run_cpt.py

Implementation:

learning_rate=float(tr_cfg.get("learning_rate", 2e-5))

Example Values: 1e-5, 2e-5, 5e-5, 1e-4

`train.weight_decay`

Type: Float
Required: No
Default: 0.0
Description: Weight decay for regularization
Used in: Lines ~479 in run_cpt.py

Implementation:

weight_decay=float(tr_cfg.get("weight_decay", 0.0))

Example Values: 0.0, 0.01, 0.1

`train.warmup_ratio`

Type: Float
Required: No
Default: 0.1
Description: Ratio of steps for learning rate warmup
Used in: Lines ~480 in run_cpt.py

Implementation:

warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))

Example Values: 0.0, 0.1, 0.2

`train.lr_scheduler_type`

Type: String
Required: No
Default: "cosine"
Description: Learning rate scheduler type
Used in: Lines ~481 in run_cpt.py

Implementation:

lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))

Example Values:
- "cosine" - Cosine annealing
- "linear" - Linear decay
- "constant" - Constant rate
- "polynomial" - Polynomial decay

`train.optim`

Type: String
Required: No
Default: "paged_adamw_8bit" (if 4-bit), "adamw_torch" (otherwise)
Description: Optimizer type
Used in: Lines ~482 in run_cpt.py

Implementation:

optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))

Example Values:
- "adamw_torch" - AdamW (standard)
- "paged_adamw_8bit" - Paged AdamW for 8-bit training
- "sgd" - SGD
- "adafactor" - Adafactor

`train.max_grad_norm`

Type: Float
Required: No
Default: 1.0
Description: Maximum gradient norm for clipping
Used in: Lines ~483 in run_cpt.py

Implementation:

max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))

Example Values: 0.5, 1.0, 2.0

`train.gradient_checkpointing`

Type: Boolean
Required: No
Default: true
Description: Use gradient checkpointing to save memory
Used in: Lines ~396-400 in run_cpt.py

Implementation:

gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
if gradient_checkpointing:
    model.gradient_checkpointing_enable()

Example Values: true, false

`train.logging_steps`

Type: Integer
Required: No
Default: 1
Description: Log training progress every N steps
Used in: Lines ~485 in run_cpt.py

Implementation:

logging_steps=int(tr_cfg.get("logging_steps", 10))

Example Values: 1, 10, 50, 100

`train.save_strategy`

Type: String
Required: No
Default: "steps"
Description: When to save model checkpoints
Used in: Lines ~487 in run_cpt.py

Implementation:

save_strategy=str(tr_cfg.get("save_strategy", "steps"))

Example Values:
- "steps" - Save every N steps
- "epochs" - Save every epoch
- "no" - Don't save

`train.save_steps`

Type: Integer
Required: No
Default: 100
Description: Save checkpoint every N steps
Used in: Lines ~488 in run_cpt.py

Implementation:

save_steps=int(tr_cfg.get("save_steps", 200))

Example Values: 50, 100, 200, 500

`train.save_total_limit`

Type: Integer
Required: No
Default: 4
Description: Maximum number of checkpoints to keep
Used in: Lines ~489 in run_cpt.py

Implementation:

save_total_limit=int(tr_cfg.get("save_total_limit", 3))

Example Values: 1, 2, 3, 5

`train.evaluation_strategy`

Type: String
Required: No
Default: "steps" (if eval data), "no" (otherwise)
Description: When to evaluate model
Used in: Lines ~494 in run_cpt.py

Implementation:

evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))

Example Values:
- "steps" - Evaluate every N steps
- "epochs" - Evaluate every epoch
- "no" - Don't evaluate

`train.eval_steps`

Type: Integer
Required: No
Default: 50
Description: Evaluate every N steps
Used in: Lines ~491 in run_cpt.py

Implementation:

eval_steps=int(tr_cfg.get("eval_steps", 200))

Example Values: 25, 50, 100, 200

`train.load_best_model_at_end`

Type: Boolean
Required: No
Default: true (if eval data), false (otherwise)
Description: Load best model at end of training
Used in: Lines ~492-493 in run_cpt.py

Implementation:

load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False

Example Values: true, false

`train.resume_from_checkpoint`

Type: String
Required: No
Default: "auto"
Description: Resume training from checkpoint
Used in: Lines ~510-520 in run_cpt.py

Implementation:

resume_from = tr_cfg.get("resume_from_checkpoint", None)
if resume_from == "auto":
    last = get_last_checkpoint(str(ckpt_dir))
    resume_from = last if last else None

Example Values:
- "auto" - Auto-detect latest checkpoint
- "checkpoint-100" - Specific checkpoint
- null - Start from scratch

Merge Parameters

`merge.enabled`

Type: Boolean
Required: No
Default: false
Description: Whether to merge LoRA adapters with base model
Used in: Lines ~545 in run_cpt.py

Implementation:

if bool(cfg.get("merge", {}).get("enabled", False)):
    merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)

Example Values: true, false

`merge.merged_dtype`

Type: String
Required: No
Default: "float16"
Description: Data type for merged model
Used in: Lines ~430 in run_cpt.py

Implementation:

merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))

Example Values: "float16", "bfloat16", "float32"

`merge.max_shard_size`

Type: String
Required: No
Default: "2GB"
Description: Maximum size per shard when saving
Used in: Lines ~445 in run_cpt.py

Implementation:

merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)

Example Values: "1GB", "2GB", "5GB"

`merge.output_dir`

Type: String (path)
Required: No
Default: "./merged_model"
Description: Directory for merged model output
Used in: Lines ~505-510 in run_cpt.py

Implementation:

if merge_cfg.get("output_dir"):
    od = Path(str(merge_cfg["output_dir"]))
    final_dir = od if od.is_absolute() else (run_dir / od)
else:
    final_dir = run_dir / "final_model"

Example Values: "./merged_model", "/workspace/final_model", "./models/merged"

Parameter Dependencies and Interactions

Memory-Related Dependencies

per_device_train_batch_size + gradient_accumulation_steps = effective batch size
block_size affects memory usage significantly
use_4bit + bnb_4bit_* parameters work together for quantization
gradient_checkpointing can enable larger block_size or batch_size

Training Strategy Dependencies

evaluation_strategy requires either eval_jsonl or eval_split_ratio > 0
load_best_model_at_end requires evaluation_strategy to be enabled
save_strategy should be compatible with evaluation_strategy
lr_scheduler_type affects warmup calculations

Model-Specific Dependencies

target_modules must match the actual module names in your model
torch_dtype should be compatible with your GPU hardware
device_map affects whether you can use certain optimizations

Data Processing Dependencies

text_field must exist in your JSONL data
pack_mode: "pad" requires block_size to be set appropriately
eval_split_ratio is ignored if eval_jsonl is provided

This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.

CPT Configuration Parameters: Detailed Guide

Table of Contents

Run Parameters

run.run_dir

run.seed

Model Parameters

model.repo_id

model.revision

model.base_local_dir

model.trust_remote_code

model.tokenizer_use_fast

model.device_map

model.torch_dtype

model.use_4bit

model.bnb_4bit_quant_type

model.bnb_4bit_use_double_quant

model.bnb_4bit_compute_dtype

model.attn_implementation

Data Parameters

data.train_jsonl

data.eval_jsonl

data.eval_split_ratio

data.text_field

data.block_size

data.shuffle

data.num_proc

data.pack_mode

PEFT Parameters

peft.enabled

peft.r

peft.lora_alpha

peft.lora_dropout

peft.bias

peft.target_modules

Training Parameters

train.num_train_epochs

train.per_device_train_batch_size

train.per_device_eval_batch_size

train.gradient_accumulation_steps

train.learning_rate

train.weight_decay

train.warmup_ratio

train.lr_scheduler_type

train.optim

train.max_grad_norm

train.gradient_checkpointing

train.logging_steps

train.save_strategy

train.save_steps

train.save_total_limit

train.evaluation_strategy

train.eval_steps

train.load_best_model_at_end

train.resume_from_checkpoint

Merge Parameters

merge.enabled

merge.merged_dtype

merge.max_shard_size

merge.output_dir

Parameter Dependencies and Interactions

Memory-Related Dependencies

Training Strategy Dependencies

Model-Specific Dependencies

Data Processing Dependencies

`run.run_dir`

`run.seed`

`model.repo_id`

`model.revision`

`model.base_local_dir`

`model.trust_remote_code`

`model.tokenizer_use_fast`

`model.device_map`

`model.torch_dtype`

`model.use_4bit`

`model.bnb_4bit_quant_type`

`model.bnb_4bit_use_double_quant`

`model.bnb_4bit_compute_dtype`

`model.attn_implementation`

`data.train_jsonl`

`data.eval_jsonl`

`data.eval_split_ratio`

`data.text_field`

`data.block_size`

`data.shuffle`

`data.num_proc`

`data.pack_mode`

`peft.enabled`

`peft.r`

`peft.lora_alpha`

`peft.lora_dropout`

`peft.bias`

`peft.target_modules`

`train.num_train_epochs`

`train.per_device_train_batch_size`

`train.per_device_eval_batch_size`

`train.gradient_accumulation_steps`

`train.learning_rate`

`train.weight_decay`

`train.warmup_ratio`

`train.lr_scheduler_type`

`train.optim`

`train.max_grad_norm`

`train.gradient_checkpointing`

`train.logging_steps`

`train.save_strategy`

`train.save_steps`

`train.save_total_limit`

`train.evaluation_strategy`

`train.eval_steps`

`train.load_best_model_at_end`

`train.resume_from_checkpoint`

`merge.enabled`

`merge.merged_dtype`

`merge.max_shard_size`

`merge.output_dir`