# CPT Configuration Parameters: Detailed Guide This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`. ## Table of Contents - [Run Parameters](#run-parameters) - [Model Parameters](#model-parameters) - [Data Parameters](#data-parameters) - [PEFT Parameters](#peft-parameters) - [Training Parameters](#training-parameters) - [Merge Parameters](#merge-parameters) --- ## Run Parameters ### `run.run_dir` - **Type**: String (path) - **Required**: Yes - **Default**: No default - **Description**: Directory where training outputs will be saved - **Used in**: Line ~480 in `run_cpt.py` - **Implementation**: ```python run_dir = _ensure_dir(Path(cfg["run"]["run_dir"])) ``` - **Example Values**: - `./runs/cpt_run_v1` - `/workspace/outputs/my_experiment` - `./checkpoints/cpt_experiment` ### `run.seed` - **Type**: Integer - **Required**: No - **Default**: None - **Description**: Random seed for reproducibility - **Used in**: Lines ~460, ~240 in `run_cpt.py` - **Implementation**: ```python set_seed(int(cfg["run"].get("seed", 42))) # Used in data shuffling and train/test split ``` - **Example Values**: `42`, `123`, `2023` --- ## Model Parameters ### `model.repo_id` - **Type**: String (path or HuggingFace repo) - **Required**: Yes - **Default**: No default - **Description**: Model identifier - can be local path or HuggingFace repository - **Used in**: Lines ~480-500 in `run_cpt.py` - **Implementation**: ```python repo_id = str(model_cfg["repo_id"]).strip() repo_path = Path(repo_id) if repo_path.exists() and repo_path.is_dir(): base_dir = repo_path # Local path else: # Download from HuggingFace snapshot_download(repo_id=repo_id, ...) ``` - **Example Values**: - Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512` - HF Repo: `meta-llama/Llama-2-7b-hf` ### `model.revision` - **Type**: String or null - **Required**: No - **Default**: null - **Description**: Specific model revision/branch/tag from HuggingFace - **Used in**: Line ~495 in `run_cpt.py` - **Implementation**: ```python snapshot_download( repo_id=repo_id, revision=model_cfg.get("revision", None), ... ) ``` - **Example Values**: - `"main"` - Main branch - `"v1.0"` - Specific tag - `"abc123def"` - Specific commit hash - `null` - Latest version ### `model.base_local_dir` - **Type**: String (path) - **Required**: No - **Default**: `"base_model"` - **Description**: Directory name for downloaded model when using HF repo - **Used in**: Line ~495 in `run_cpt.py` - **Implementation**: ```python base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model")) ``` - **Example Values**: `"base_model"`, `"downloaded_model"`, `"model_files"` ### `model.trust_remote_code` - **Type**: Boolean - **Required**: No - **Default**: `true` - **Description**: Allow loading models with custom code - **Used in**: Lines ~320, ~340, ~450 in `run_cpt.py` - **Implementation**: ```python tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...) ``` - **Example Values**: `true`, `false` ### `model.tokenizer_use_fast` - **Type**: Boolean - **Required**: No - **Default**: `true` - **Description**: Use fast tokenizer implementation - **Used in**: Lines ~320, ~450 in `run_cpt.py` - **Implementation**: ```python tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) ``` - **Example Values**: `true`, `false` ### `model.device_map` - **Type**: String - **Required**: No - **Default**: `"auto"` - **Description**: How to distribute model across devices - **Used in**: Lines ~350, ~370 in `run_cpt.py` - **Implementation**: ```python model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...) ``` - **Example Values**: - `"auto"` - Automatic distribution - `"cpu"` - CPU only - `"cuda:0"` - Single GPU - `{"": 0}` - Manual mapping ### `model.torch_dtype` - **Type**: String - **Required**: No - **Default**: `"bfloat16"` - **Description**: Data type for model tensors - **Used in**: Lines ~45, ~350 in `run_cpt.py` - **Implementation**: ```python def _dtype_from_str(s: str) -> torch.dtype: if s in ("float16", "fp16"): return torch.float16 if s in ("bfloat16", "bf16"): return torch.bfloat16 if s in ("float32", "fp32"): return torch.float32 ``` - **Example Values**: - `"float16"` - 16-bit floats (faster, less memory, less stable) - `"bfloat16"` - Brain float16 (stable, good for training) - `"float32"` - 32-bit floats (slowest, most memory) ### `model.use_4bit` - **Type**: Boolean - **Required**: No - **Default**: `false` - **Description**: Use 4-bit quantization for memory efficiency - **Used in**: Lines ~325, ~395 in `run_cpt.py` - **Implementation**: ```python use_4bit = bool(model_cfg.get("use_4bit", False)) if use_4bit: quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...) ``` - **Example Values**: `true`, `false` ### `model.bnb_4bit_quant_type` - **Type**: String - **Required**: No - **Default**: `"nf4"` - **Description**: 4-bit quantization type - **Used in**: Lines ~328 in `run_cpt.py` - **Implementation**: ```python bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")) ``` - **Example Values**: - `"nf4"` - NormalFloat4 (recommended) - `"fp4"` - FloatingPoint4 - `"int4"` - Integer4 ### `model.bnb_4bit_use_double_quant` - **Type**: Boolean - **Required**: No - **Default**: `false` - **Description**: Use double quantization for memory efficiency - **Used in**: Lines ~329 in `run_cpt.py` - **Implementation**: ```python bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True)) ``` - **Example Values**: `true`, `false` ### `model.bnb_4bit_compute_dtype` - **Type**: String - **Required**: No - **Default**: `"bfloat16"` - **Description**: Compute dtype for 4-bit quantization - **Used in**: Lines ~330 in `run_cpt.py` - **Implementation**: ```python bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")) ``` - **Example Values**: `"float16"`, `"bfloat16"`, `"float32"` ### `model.attn_implementation` - **Type**: String or null - **Required**: No - **Default**: `null` - **Description**: Attention implementation to use - **Used in**: Lines ~155, ~350 in `run_cpt.py` - **Implementation**: ```python def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]: return cfg.get("model", {}).get("attn_implementation", None) # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...) ``` - **Example Values**: - `"flash_attention_2"` - Flash Attention 2 (fastest) - `"sdpa"` - Scaled Dot-Product Attention - `null` - Default implementation --- ## Data Parameters ### `data.train_jsonl` - **Type**: String (path) - **Required**: Yes - **Default**: No default - **Description**: Path to training data in JSONL format - **Used in**: Lines ~170 in `run_cpt.py` - **Implementation**: ```python train_path = data_cfg["train_jsonl"] ds = load_dataset("json", data_files={"train": train_path}) ``` - **Example Values**: `"/workspace/all_data_with_descriptions.jsonl"` ### `data.eval_jsonl` - **Type**: String (path) or null - **Required**: No - **Default**: `null` - **Description**: Path to evaluation data in JSONL format - **Used in**: Lines ~175 in `run_cpt.py` - **Implementation**: ```python eval_path = data_cfg.get("eval_jsonl", None) if eval_path: ds_eval = load_dataset("json", data_files={"eval": eval_path}) ``` - **Example Values**: `null` (no separate eval file), `"/workspace/eval_data.jsonl"` ### `data.eval_split_ratio` - **Type**: Float - **Required**: No - **Default**: `0.1` - **Description**: Ratio of training data to use for evaluation split - **Used in**: Lines ~177 in `run_cpt.py` - **Implementation**: ```python split_ratio = float(data_cfg.get("eval_split_ratio", 0.0)) if 0.0 < split_ratio < 1.0: split = ds["train"].train_test_split(test_size=split_ratio, seed=seed) ``` - **Example Values**: `0.1` (10%), `0.2` (20%), `0.05` (5%) ### `data.text_field` - **Type**: String - **Required**: No - **Default**: `"text"` - **Description**: Field name in JSONL containing the text data - **Used in**: Lines ~185 in `run_cpt.py` - **Implementation**: ```python text_field = data_cfg.get("text_field", "text") # Used in tokenization tokenized = dsd["train"].map( tokenize_fn, batched=True, remove_columns=dsd["train"].column_names, desc="Tokenizing train", ) ``` - **Example Values**: `"text"`, `"content"`, `"prompt"`, `"input"` ### `data.block_size` - **Type**: Integer - **Required**: No - **Default**: `4096` - **Description**: Maximum sequence length for training - **Used in**: Lines ~180 in `run_cpt.py` - **Implementation**: ```python block_size = int(data_cfg.get("block_size", 2048)) # Used in grouping texts into blocks for i in range(0, full_len, block_size): chunk = concatenated["input_ids"][i:i + block_size] ``` - **Example Values**: `2048`, `4096`, `8192` ### `data.shuffle` - **Type**: Boolean - **Required**: No - **Default**: `true` - **Description**: Whether to shuffle training data - **Used in**: Lines ~235 in `run_cpt.py` - **Implementation**: ```python if shuffle: tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42))) ``` - **Example Values**: `true`, `false` ### `data.num_proc` - **Type**: Integer - **Required**: No - **Default**: `4` - **Description**: Number of processes for data loading - **Used in**: Lines ~200, ~210 in `run_cpt.py` - **Implementation**: ```python num_proc = int(data_cfg.get("num_proc", 4)) tokenized_train = dsd["train"].map( tokenize_fn, batched=True, num_proc=num_proc, ... ) ``` - **Example Values**: `1`, `4`, `8`, `16` ### `data.pack_mode` - **Type**: String - **Required**: No - **Default**: `"pad"` - **Description**: How to handle remainder tokens in final block - **Used in**: Lines ~150-230 in `run_cpt.py` - **Implementation**: ```python pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip() if pack_mode == "pad": # Pad remainder and mask loss labels[-pad_len:] = [-100] * pad_len # If "drop": ignore remainder entirely ``` - **Example Values**: - `"drop"` - Drop incomplete blocks (strict CPT) - `"pad"` - Pad incomplete blocks with masked loss --- ## PEFT Parameters ### `peft.enabled` - **Type**: Boolean - **Required**: No - **Default**: `true` - **Description**: Whether to use PEFT (Parameter-Efficient Fine-Tuning) - **Used in**: Lines ~395 in `run_cpt.py` - **Implementation**: ```python if not bool(peft_cfg.get("enabled", True)): return model, None # Otherwise proceed with LoRA configuration ``` - **Example Values**: `true`, `false` ### `peft.r` - **Type**: Integer - **Required**: No - **Default**: `64` - **Description**: LoRA rank - dimension of low-rank matrices - **Used in**: Lines ~415 in `run_cpt.py` - **Implementation**: ```python lora_config = LoraConfig( r=int(peft_cfg.get("r", 16)), ... ) ``` - **Example Values**: `8`, `16`, `32`, `64`, `128` - **Note**: Higher values = more parameters but potentially better performance ### `peft.lora_alpha` - **Type**: Integer - **Required**: No - **Default**: `128` - **Description**: LoRA alpha scaling parameter - **Used in**: Lines ~416 in `run_cpt.py` - **Implementation**: ```python lora_config = LoraConfig( lora_alpha=int(peft_cfg.get("lora_alpha", 32)), ... ) ``` - **Example Values**: `16`, `32`, `64`, `128`, `256` ### `peft.lora_dropout` - **Type**: Float - **Required**: No - **Default**: `0.05` - **Description**: Dropout rate for LoRA layers - **Used in**: Lines ~417 in `run_cpt.py` - **Implementation**: ```python lora_config = LoraConfig( lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)), ... ) ``` - **Example Values**: `0.0`, `0.05`, `0.1`, `0.2` ### `peft.bias` - **Type**: String - **Required**: No - **Default**: `"none"` - **Description**: Bias training strategy - **Used in**: Lines ~418 in `run_cpt.py` - **Implementation**: ```python lora_config = LoraConfig( bias=str(peft_cfg.get("bias", "none")), ... ) ``` - **Example Values**: - `"none"` - No bias training - `"all"` - Train all biases - `"lora_only"` - Only LoRA bias ### `peft.target_modules` - **Type**: String or List - **Required**: No - **Default**: `"auto"` - **Description**: Which modules to apply LoRA to - **Used in**: Lines ~405, ~140-170 in `run_cpt.py` - **Implementation**: ```python target_modules = peft_cfg.get("target_modules", "auto") if target_modules == "auto": target_modules = _infer_target_modules(model) ``` - **Example Values**: - `"auto"` - Automatic detection - `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list - `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only --- ## Training Parameters ### `train.num_train_epochs` - **Type**: Float - **Required**: No - **Default**: `2` - **Description**: Number of epochs to train - **Used in**: Lines ~470 in `run_cpt.py` - **Implementation**: ```python num_train_epochs = float(tr_cfg.get("num_train_epochs", 1)) # Used in TrainingArguments ``` - **Example Values**: `1.0`, `2.0`, `3.5` ### `train.per_device_train_batch_size` - **Type**: Integer - **Required**: No - **Default**: `1` - **Description**: Training batch size per device - **Used in**: Lines ~475 in `run_cpt.py` - **Implementation**: ```python per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)) ``` - **Example Values**: `1`, `2`, `4`, `8` ### `train.per_device_eval_batch_size` - **Type**: Integer - **Required**: No - **Default**: Same as train batch size - **Description**: Evaluation batch size per device - **Used in**: Lines ~476 in `run_cpt.py` - **Implementation**: ```python per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1))) ``` - **Example Values**: `1`, `2`, `4`, `8` ### `train.gradient_accumulation_steps` - **Type**: Integer - **Required**: No - **Default**: `16` - **Description**: Number of steps to accumulate gradients - **Used in**: Lines ~477 in `run_cpt.py` - **Implementation**: ```python gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)) ``` - **Example Values**: `1`, `4`, `8`, `16`, `32` ### `train.learning_rate` - **Type**: Float - **Required**: No - **Default**: `2e-5` - **Description**: Learning rate for optimizer - **Used in**: Lines ~478 in `run_cpt.py` - **Implementation**: ```python learning_rate=float(tr_cfg.get("learning_rate", 2e-5)) ``` - **Example Values**: `1e-5`, `2e-5`, `5e-5`, `1e-4` ### `train.weight_decay` - **Type**: Float - **Required**: No - **Default**: `0.0` - **Description**: Weight decay for regularization - **Used in**: Lines ~479 in `run_cpt.py` - **Implementation**: ```python weight_decay=float(tr_cfg.get("weight_decay", 0.0)) ``` - **Example Values**: `0.0`, `0.01`, `0.1` ### `train.warmup_ratio` - **Type**: Float - **Required**: No - **Default**: `0.1` - **Description**: Ratio of steps for learning rate warmup - **Used in**: Lines ~480 in `run_cpt.py` - **Implementation**: ```python warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)) ``` - **Example Values**: `0.0`, `0.1`, `0.2` ### `train.lr_scheduler_type` - **Type**: String - **Required**: No - **Default**: `"cosine"` - **Description**: Learning rate scheduler type - **Used in**: Lines ~481 in `run_cpt.py` - **Implementation**: ```python lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")) ``` - **Example Values**: - `"cosine"` - Cosine annealing - `"linear"` - Linear decay - `"constant"` - Constant rate - `"polynomial"` - Polynomial decay ### `train.optim` - **Type**: String - **Required**: No - **Default**: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise) - **Description**: Optimizer type - **Used in**: Lines ~482 in `run_cpt.py` - **Implementation**: ```python optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch")) ``` - **Example Values**: - `"adamw_torch"` - AdamW (standard) - `"paged_adamw_8bit"` - Paged AdamW for 8-bit training - `"sgd"` - SGD - `"adafactor"` - Adafactor ### `train.max_grad_norm` - **Type**: Float - **Required**: No - **Default**: `1.0` - **Description**: Maximum gradient norm for clipping - **Used in**: Lines ~483 in `run_cpt.py` - **Implementation**: ```python max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)) ``` - **Example Values**: `0.5`, `1.0`, `2.0` ### `train.gradient_checkpointing` - **Type**: Boolean - **Required**: No - **Default**: `true` - **Description**: Use gradient checkpointing to save memory - **Used in**: Lines ~396-400 in `run_cpt.py` - **Implementation**: ```python gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True)) if gradient_checkpointing: model.gradient_checkpointing_enable() ``` - **Example Values**: `true`, `false` ### `train.logging_steps` - **Type**: Integer - **Required**: No - **Default**: `1` - **Description**: Log training progress every N steps - **Used in**: Lines ~485 in `run_cpt.py` - **Implementation**: ```python logging_steps=int(tr_cfg.get("logging_steps", 10)) ``` - **Example Values**: `1`, `10`, `50`, `100` ### `train.save_strategy` - **Type**: String - **Required**: No - **Default**: `"steps"` - **Description**: When to save model checkpoints - **Used in**: Lines ~487 in `run_cpt.py` - **Implementation**: ```python save_strategy=str(tr_cfg.get("save_strategy", "steps")) ``` - **Example Values**: - `"steps"` - Save every N steps - `"epochs"` - Save every epoch - `"no"` - Don't save ### `train.save_steps` - **Type**: Integer - **Required**: No - **Default**: `100` - **Description**: Save checkpoint every N steps - **Used in**: Lines ~488 in `run_cpt.py` - **Implementation**: ```python save_steps=int(tr_cfg.get("save_steps", 200)) ``` - **Example Values**: `50`, `100`, `200`, `500` ### `train.save_total_limit` - **Type**: Integer - **Required**: No - **Default**: `4` - **Description**: Maximum number of checkpoints to keep - **Used in**: Lines ~489 in `run_cpt.py` - **Implementation**: ```python save_total_limit=int(tr_cfg.get("save_total_limit", 3)) ``` - **Example Values**: `1`, `2`, `3`, `5` ### `train.evaluation_strategy` - **Type**: String - **Required**: No - **Default**: `"steps"` (if eval data), `"no"` (otherwise) - **Description**: When to evaluate model - **Used in**: Lines ~494 in `run_cpt.py` - **Implementation**: ```python evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no")) ``` - **Example Values**: - `"steps"` - Evaluate every N steps - `"epochs"` - Evaluate every epoch - `"no"` - Don't evaluate ### `train.eval_steps` - **Type**: Integer - **Required**: No - **Default**: `50` - **Description**: Evaluate every N steps - **Used in**: Lines ~491 in `run_cpt.py` - **Implementation**: ```python eval_steps=int(tr_cfg.get("eval_steps", 200)) ``` - **Example Values**: `25`, `50`, `100`, `200` ### `train.load_best_model_at_end` - **Type**: Boolean - **Required**: No - **Default**: `true` (if eval data), `false` (otherwise) - **Description**: Load best model at end of training - **Used in**: Lines ~492-493 in `run_cpt.py` - **Implementation**: ```python load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False ``` - **Example Values**: `true`, `false` ### `train.resume_from_checkpoint` - **Type**: String - **Required**: No - **Default**: `"auto"` - **Description**: Resume training from checkpoint - **Used in**: Lines ~510-520 in `run_cpt.py` - **Implementation**: ```python resume_from = tr_cfg.get("resume_from_checkpoint", None) if resume_from == "auto": last = get_last_checkpoint(str(ckpt_dir)) resume_from = last if last else None ``` - **Example Values**: - `"auto"` - Auto-detect latest checkpoint - `"checkpoint-100"` - Specific checkpoint - `null` - Start from scratch --- ## Merge Parameters ### `merge.enabled` - **Type**: Boolean - **Required**: No - **Default**: `false` - **Description**: Whether to merge LoRA adapters with base model - **Used in**: Lines ~545 in `run_cpt.py` - **Implementation**: ```python if bool(cfg.get("merge", {}).get("enabled", False)): merge_adapter(cfg, base_dir, best_adapter_dir, final_dir) ``` - **Example Values**: `true`, `false` ### `merge.merged_dtype` - **Type**: String - **Required**: No - **Default**: `"float16"` - **Description**: Data type for merged model - **Used in**: Lines ~430 in `run_cpt.py` - **Implementation**: ```python merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16")) ``` - **Example Values**: `"float16"`, `"bfloat16"`, `"float32"` ### `merge.max_shard_size` - **Type**: String - **Required**: No - **Default**: `"2GB"` - **Description**: Maximum size per shard when saving - **Used in**: Lines ~445 in `run_cpt.py` - **Implementation**: ```python merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size) ``` - **Example Values**: `"1GB"`, `"2GB"`, `"5GB"` ### `merge.output_dir` - **Type**: String (path) - **Required**: No - **Default**: `"./merged_model"` - **Description**: Directory for merged model output - **Used in**: Lines ~505-510 in `run_cpt.py` - **Implementation**: ```python if merge_cfg.get("output_dir"): od = Path(str(merge_cfg["output_dir"])) final_dir = od if od.is_absolute() else (run_dir / od) else: final_dir = run_dir / "final_model" ``` - **Example Values**: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"` --- ## Parameter Dependencies and Interactions ### Memory-Related Dependencies - `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size - `block_size` affects memory usage significantly - `use_4bit` + `bnb_4bit_*` parameters work together for quantization - `gradient_checkpointing` can enable larger `block_size` or `batch_size` ### Training Strategy Dependencies - `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0` - `load_best_model_at_end` requires `evaluation_strategy` to be enabled - `save_strategy` should be compatible with `evaluation_strategy` - `lr_scheduler_type` affects warmup calculations ### Model-Specific Dependencies - `target_modules` must match the actual module names in your model - `torch_dtype` should be compatible with your GPU hardware - `device_map` affects whether you can use certain optimizations ### Data Processing Dependencies - `text_field` must exist in your JSONL data - `pack_mode: "pad"` requires `block_size` to be set appropriately - `eval_split_ratio` is ignored if `eval_jsonl` is provided This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.