CPT Configuration Parameters: Detailed Guide
This document provides a comprehensive explanation of all configuration parameters in config.yaml and how they're implemented in run_cpt.py.
Table of Contents
- Run Parameters
- Model Parameters
- Data Parameters
- PEFT Parameters
- Training Parameters
- Merge Parameters
Run Parameters
run.run_dir
- Type: String (path)
- Required: Yes
- Default: No default
- Description: Directory where training outputs will be saved
- Used in: Line ~480 in
run_cpt.py - Implementation:
run_dir = _ensure_dir(Path(cfg["run"]["run_dir"])) - Example Values:
./runs/cpt_run_v1/workspace/outputs/my_experiment./checkpoints/cpt_experiment
run.seed
- Type: Integer
- Required: No
- Default: None
- Description: Random seed for reproducibility
- Used in: Lines ~460, ~240 in
run_cpt.py - Implementation:
set_seed(int(cfg["run"].get("seed", 42))) # Used in data shuffling and train/test split - Example Values:
42,123,2023
Model Parameters
model.repo_id
- Type: String (path or HuggingFace repo)
- Required: Yes
- Default: No default
- Description: Model identifier - can be local path or HuggingFace repository
- Used in: Lines ~480-500 in
run_cpt.py - Implementation:
repo_id = str(model_cfg["repo_id"]).strip() repo_path = Path(repo_id) if repo_path.exists() and repo_path.is_dir(): base_dir = repo_path # Local path else: # Download from HuggingFace snapshot_download(repo_id=repo_id, ...) - Example Values:
- Local:
/workspace/Models/Devstral-Small-2-24B-Instruct-2512 - HF Repo:
meta-llama/Llama-2-7b-hf
- Local:
model.revision
- Type: String or null
- Required: No
- Default: null
- Description: Specific model revision/branch/tag from HuggingFace
- Used in: Line ~495 in
run_cpt.py - Implementation:
snapshot_download( repo_id=repo_id, revision=model_cfg.get("revision", None), ... ) - Example Values:
"main"- Main branch"v1.0"- Specific tag"abc123def"- Specific commit hashnull- Latest version
model.base_local_dir
- Type: String (path)
- Required: No
- Default:
"base_model" - Description: Directory name for downloaded model when using HF repo
- Used in: Line ~495 in
run_cpt.py - Implementation:
base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model")) - Example Values:
"base_model","downloaded_model","model_files"
model.trust_remote_code
- Type: Boolean
- Required: No
- Default:
true - Description: Allow loading models with custom code
- Used in: Lines ~320, ~340, ~450 in
run_cpt.py - Implementation:
tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...) - Example Values:
true,false
model.tokenizer_use_fast
- Type: Boolean
- Required: No
- Default:
true - Description: Use fast tokenizer implementation
- Used in: Lines ~320, ~450 in
run_cpt.py - Implementation:
tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) - Example Values:
true,false
model.device_map
- Type: String
- Required: No
- Default:
"auto" - Description: How to distribute model across devices
- Used in: Lines ~350, ~370 in
run_cpt.py - Implementation:
model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...) - Example Values:
"auto"- Automatic distribution"cpu"- CPU only"cuda:0"- Single GPU{"": 0}- Manual mapping
model.torch_dtype
- Type: String
- Required: No
- Default:
"bfloat16" - Description: Data type for model tensors
- Used in: Lines ~45, ~350 in
run_cpt.py - Implementation:
def _dtype_from_str(s: str) -> torch.dtype: if s in ("float16", "fp16"): return torch.float16 if s in ("bfloat16", "bf16"): return torch.bfloat16 if s in ("float32", "fp32"): return torch.float32 - Example Values:
"float16"- 16-bit floats (faster, less memory, less stable)"bfloat16"- Brain float16 (stable, good for training)"float32"- 32-bit floats (slowest, most memory)
model.use_4bit
- Type: Boolean
- Required: No
- Default:
false - Description: Use 4-bit quantization for memory efficiency
- Used in: Lines ~325, ~395 in
run_cpt.py - Implementation:
use_4bit = bool(model_cfg.get("use_4bit", False)) if use_4bit: quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...) - Example Values:
true,false
model.bnb_4bit_quant_type
- Type: String
- Required: No
- Default:
"nf4" - Description: 4-bit quantization type
- Used in: Lines ~328 in
run_cpt.py - Implementation:
bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")) - Example Values:
"nf4"- NormalFloat4 (recommended)"fp4"- FloatingPoint4"int4"- Integer4
model.bnb_4bit_use_double_quant
- Type: Boolean
- Required: No
- Default:
false - Description: Use double quantization for memory efficiency
- Used in: Lines ~329 in
run_cpt.py - Implementation:
bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True)) - Example Values:
true,false
model.bnb_4bit_compute_dtype
- Type: String
- Required: No
- Default:
"bfloat16" - Description: Compute dtype for 4-bit quantization
- Used in: Lines ~330 in
run_cpt.py - Implementation:
bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")) - Example Values:
"float16","bfloat16","float32"
model.attn_implementation
- Type: String or null
- Required: No
- Default:
null - Description: Attention implementation to use
- Used in: Lines ~155, ~350 in
run_cpt.py - Implementation:
def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]: return cfg.get("model", {}).get("attn_implementation", None) # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...) - Example Values:
"flash_attention_2"- Flash Attention 2 (fastest)"sdpa"- Scaled Dot-Product Attentionnull- Default implementation
Data Parameters
data.train_jsonl
- Type: String (path)
- Required: Yes
- Default: No default
- Description: Path to training data in JSONL format
- Used in: Lines ~170 in
run_cpt.py - Implementation:
train_path = data_cfg["train_jsonl"] ds = load_dataset("json", data_files={"train": train_path}) - Example Values:
"/workspace/all_data_with_descriptions.jsonl"
data.eval_jsonl
- Type: String (path) or null
- Required: No
- Default:
null - Description: Path to evaluation data in JSONL format
- Used in: Lines ~175 in
run_cpt.py - Implementation:
eval_path = data_cfg.get("eval_jsonl", None) if eval_path: ds_eval = load_dataset("json", data_files={"eval": eval_path}) - Example Values:
null(no separate eval file),"/workspace/eval_data.jsonl"
data.eval_split_ratio
- Type: Float
- Required: No
- Default:
0.1 - Description: Ratio of training data to use for evaluation split
- Used in: Lines ~177 in
run_cpt.py - Implementation:
split_ratio = float(data_cfg.get("eval_split_ratio", 0.0)) if 0.0 < split_ratio < 1.0: split = ds["train"].train_test_split(test_size=split_ratio, seed=seed) - Example Values:
0.1(10%),0.2(20%),0.05(5%)
data.text_field
- Type: String
- Required: No
- Default:
"text" - Description: Field name in JSONL containing the text data
- Used in: Lines ~185 in
run_cpt.py - Implementation:
text_field = data_cfg.get("text_field", "text") # Used in tokenization tokenized = dsd["train"].map( tokenize_fn, batched=True, remove_columns=dsd["train"].column_names, desc="Tokenizing train", ) - Example Values:
"text","content","prompt","input"
data.block_size
- Type: Integer
- Required: No
- Default:
4096 - Description: Maximum sequence length for training
- Used in: Lines ~180 in
run_cpt.py - Implementation:
block_size = int(data_cfg.get("block_size", 2048)) # Used in grouping texts into blocks for i in range(0, full_len, block_size): chunk = concatenated["input_ids"][i:i + block_size] - Example Values:
2048,4096,8192
data.shuffle
- Type: Boolean
- Required: No
- Default:
true - Description: Whether to shuffle training data
- Used in: Lines ~235 in
run_cpt.py - Implementation:
if shuffle: tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42))) - Example Values:
true,false
data.num_proc
- Type: Integer
- Required: No
- Default:
4 - Description: Number of processes for data loading
- Used in: Lines ~200, ~210 in
run_cpt.py - Implementation:
num_proc = int(data_cfg.get("num_proc", 4)) tokenized_train = dsd["train"].map( tokenize_fn, batched=True, num_proc=num_proc, ... ) - Example Values:
1,4,8,16
data.pack_mode
- Type: String
- Required: No
- Default:
"pad" - Description: How to handle remainder tokens in final block
- Used in: Lines ~150-230 in
run_cpt.py - Implementation:
pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip() if pack_mode == "pad": # Pad remainder and mask loss labels[-pad_len:] = [-100] * pad_len # If "drop": ignore remainder entirely - Example Values:
"drop"- Drop incomplete blocks (strict CPT)"pad"- Pad incomplete blocks with masked loss
PEFT Parameters
peft.enabled
- Type: Boolean
- Required: No
- Default:
true - Description: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
- Used in: Lines ~395 in
run_cpt.py - Implementation:
if not bool(peft_cfg.get("enabled", True)): return model, None # Otherwise proceed with LoRA configuration - Example Values:
true,false
peft.r
- Type: Integer
- Required: No
- Default:
64 - Description: LoRA rank - dimension of low-rank matrices
- Used in: Lines ~415 in
run_cpt.py - Implementation:
lora_config = LoraConfig( r=int(peft_cfg.get("r", 16)), ... ) - Example Values:
8,16,32,64,128 - Note: Higher values = more parameters but potentially better performance
peft.lora_alpha
- Type: Integer
- Required: No
- Default:
128 - Description: LoRA alpha scaling parameter
- Used in: Lines ~416 in
run_cpt.py - Implementation:
lora_config = LoraConfig( lora_alpha=int(peft_cfg.get("lora_alpha", 32)), ... ) - Example Values:
16,32,64,128,256
peft.lora_dropout
- Type: Float
- Required: No
- Default:
0.05 - Description: Dropout rate for LoRA layers
- Used in: Lines ~417 in
run_cpt.py - Implementation:
lora_config = LoraConfig( lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)), ... ) - Example Values:
0.0,0.05,0.1,0.2
peft.bias
- Type: String
- Required: No
- Default:
"none" - Description: Bias training strategy
- Used in: Lines ~418 in
run_cpt.py - Implementation:
lora_config = LoraConfig( bias=str(peft_cfg.get("bias", "none")), ... ) - Example Values:
"none"- No bias training"all"- Train all biases"lora_only"- Only LoRA bias
peft.target_modules
- Type: String or List
- Required: No
- Default:
"auto" - Description: Which modules to apply LoRA to
- Used in: Lines ~405, ~140-170 in
run_cpt.py - Implementation:
target_modules = peft_cfg.get("target_modules", "auto") if target_modules == "auto": target_modules = _infer_target_modules(model) - Example Values:
"auto"- Automatic detection["q_proj", "k_proj", "v_proj", "o_proj"]- Explicit list["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]- MLP only
Training Parameters
train.num_train_epochs
- Type: Float
- Required: No
- Default:
2 - Description: Number of epochs to train
- Used in: Lines ~470 in
run_cpt.py - Implementation:
num_train_epochs = float(tr_cfg.get("num_train_epochs", 1)) # Used in TrainingArguments - Example Values:
1.0,2.0,3.5
train.per_device_train_batch_size
- Type: Integer
- Required: No
- Default:
1 - Description: Training batch size per device
- Used in: Lines ~475 in
run_cpt.py - Implementation:
per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)) - Example Values:
1,2,4,8
train.per_device_eval_batch_size
- Type: Integer
- Required: No
- Default: Same as train batch size
- Description: Evaluation batch size per device
- Used in: Lines ~476 in
run_cpt.py - Implementation:
per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1))) - Example Values:
1,2,4,8
train.gradient_accumulation_steps
- Type: Integer
- Required: No
- Default:
16 - Description: Number of steps to accumulate gradients
- Used in: Lines ~477 in
run_cpt.py - Implementation:
gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)) - Example Values:
1,4,8,16,32
train.learning_rate
- Type: Float
- Required: No
- Default:
2e-5 - Description: Learning rate for optimizer
- Used in: Lines ~478 in
run_cpt.py - Implementation:
learning_rate=float(tr_cfg.get("learning_rate", 2e-5)) - Example Values:
1e-5,2e-5,5e-5,1e-4
train.weight_decay
- Type: Float
- Required: No
- Default:
0.0 - Description: Weight decay for regularization
- Used in: Lines ~479 in
run_cpt.py - Implementation:
weight_decay=float(tr_cfg.get("weight_decay", 0.0)) - Example Values:
0.0,0.01,0.1
train.warmup_ratio
- Type: Float
- Required: No
- Default:
0.1 - Description: Ratio of steps for learning rate warmup
- Used in: Lines ~480 in
run_cpt.py - Implementation:
warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)) - Example Values:
0.0,0.1,0.2
train.lr_scheduler_type
- Type: String
- Required: No
- Default:
"cosine" - Description: Learning rate scheduler type
- Used in: Lines ~481 in
run_cpt.py - Implementation:
lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")) - Example Values:
"cosine"- Cosine annealing"linear"- Linear decay"constant"- Constant rate"polynomial"- Polynomial decay
train.optim
- Type: String
- Required: No
- Default:
"paged_adamw_8bit"(if 4-bit),"adamw_torch"(otherwise) - Description: Optimizer type
- Used in: Lines ~482 in
run_cpt.py - Implementation:
optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch")) - Example Values:
"adamw_torch"- AdamW (standard)"paged_adamw_8bit"- Paged AdamW for 8-bit training"sgd"- SGD"adafactor"- Adafactor
train.max_grad_norm
- Type: Float
- Required: No
- Default:
1.0 - Description: Maximum gradient norm for clipping
- Used in: Lines ~483 in
run_cpt.py - Implementation:
max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)) - Example Values:
0.5,1.0,2.0
train.gradient_checkpointing
- Type: Boolean
- Required: No
- Default:
true - Description: Use gradient checkpointing to save memory
- Used in: Lines ~396-400 in
run_cpt.py - Implementation:
gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True)) if gradient_checkpointing: model.gradient_checkpointing_enable() - Example Values:
true,false
train.logging_steps
- Type: Integer
- Required: No
- Default:
1 - Description: Log training progress every N steps
- Used in: Lines ~485 in
run_cpt.py - Implementation:
logging_steps=int(tr_cfg.get("logging_steps", 10)) - Example Values:
1,10,50,100
train.save_strategy
- Type: String
- Required: No
- Default:
"steps" - Description: When to save model checkpoints
- Used in: Lines ~487 in
run_cpt.py - Implementation:
save_strategy=str(tr_cfg.get("save_strategy", "steps")) - Example Values:
"steps"- Save every N steps"epochs"- Save every epoch"no"- Don't save
train.save_steps
- Type: Integer
- Required: No
- Default:
100 - Description: Save checkpoint every N steps
- Used in: Lines ~488 in
run_cpt.py - Implementation:
save_steps=int(tr_cfg.get("save_steps", 200)) - Example Values:
50,100,200,500
train.save_total_limit
- Type: Integer
- Required: No
- Default:
4 - Description: Maximum number of checkpoints to keep
- Used in: Lines ~489 in
run_cpt.py - Implementation:
save_total_limit=int(tr_cfg.get("save_total_limit", 3)) - Example Values:
1,2,3,5
train.evaluation_strategy
- Type: String
- Required: No
- Default:
"steps"(if eval data),"no"(otherwise) - Description: When to evaluate model
- Used in: Lines ~494 in
run_cpt.py - Implementation:
evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no")) - Example Values:
"steps"- Evaluate every N steps"epochs"- Evaluate every epoch"no"- Don't evaluate
train.eval_steps
- Type: Integer
- Required: No
- Default:
50 - Description: Evaluate every N steps
- Used in: Lines ~491 in
run_cpt.py - Implementation:
eval_steps=int(tr_cfg.get("eval_steps", 200)) - Example Values:
25,50,100,200
train.load_best_model_at_end
- Type: Boolean
- Required: No
- Default:
true(if eval data),false(otherwise) - Description: Load best model at end of training
- Used in: Lines ~492-493 in
run_cpt.py - Implementation:
load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False - Example Values:
true,false
train.resume_from_checkpoint
- Type: String
- Required: No
- Default:
"auto" - Description: Resume training from checkpoint
- Used in: Lines ~510-520 in
run_cpt.py - Implementation:
resume_from = tr_cfg.get("resume_from_checkpoint", None) if resume_from == "auto": last = get_last_checkpoint(str(ckpt_dir)) resume_from = last if last else None - Example Values:
"auto"- Auto-detect latest checkpoint"checkpoint-100"- Specific checkpointnull- Start from scratch
Merge Parameters
merge.enabled
- Type: Boolean
- Required: No
- Default:
false - Description: Whether to merge LoRA adapters with base model
- Used in: Lines ~545 in
run_cpt.py - Implementation:
if bool(cfg.get("merge", {}).get("enabled", False)): merge_adapter(cfg, base_dir, best_adapter_dir, final_dir) - Example Values:
true,false
merge.merged_dtype
- Type: String
- Required: No
- Default:
"float16" - Description: Data type for merged model
- Used in: Lines ~430 in
run_cpt.py - Implementation:
merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16")) - Example Values:
"float16","bfloat16","float32"
merge.max_shard_size
- Type: String
- Required: No
- Default:
"2GB" - Description: Maximum size per shard when saving
- Used in: Lines ~445 in
run_cpt.py - Implementation:
merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size) - Example Values:
"1GB","2GB","5GB"
merge.output_dir
- Type: String (path)
- Required: No
- Default:
"./merged_model" - Description: Directory for merged model output
- Used in: Lines ~505-510 in
run_cpt.py - Implementation:
if merge_cfg.get("output_dir"): od = Path(str(merge_cfg["output_dir"])) final_dir = od if od.is_absolute() else (run_dir / od) else: final_dir = run_dir / "final_model" - Example Values:
"./merged_model","/workspace/final_model","./models/merged"
Parameter Dependencies and Interactions
Memory-Related Dependencies
per_device_train_batch_size+gradient_accumulation_steps= effective batch sizeblock_sizeaffects memory usage significantlyuse_4bit+bnb_4bit_*parameters work together for quantizationgradient_checkpointingcan enable largerblock_sizeorbatch_size
Training Strategy Dependencies
evaluation_strategyrequires eithereval_jsonloreval_split_ratio > 0load_best_model_at_endrequiresevaluation_strategyto be enabledsave_strategyshould be compatible withevaluation_strategylr_scheduler_typeaffects warmup calculations
Model-Specific Dependencies
target_modulesmust match the actual module names in your modeltorch_dtypeshould be compatible with your GPU hardwaredevice_mapaffects whether you can use certain optimizations
Data Processing Dependencies
text_fieldmust exist in your JSONL datapack_mode: "pad"requiresblock_sizeto be set appropriatelyeval_split_ratiois ignored ifeval_jsonlis provided
This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.