| # CPT Configuration Parameters: Detailed Guide | |
| This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`. | |
| ## Table of Contents | |
| - [Run Parameters](#run-parameters) | |
| - [Model Parameters](#model-parameters) | |
| - [Data Parameters](#data-parameters) | |
| - [PEFT Parameters](#peft-parameters) | |
| - [Training Parameters](#training-parameters) | |
| - [Merge Parameters](#merge-parameters) | |
| --- | |
| ## Run Parameters | |
| ### `run.run_dir` | |
| - **Type**: String (path) | |
| - **Required**: Yes | |
| - **Default**: No default | |
| - **Description**: Directory where training outputs will be saved | |
| - **Used in**: Line ~480 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| run_dir = _ensure_dir(Path(cfg["run"]["run_dir"])) | |
| ``` | |
| - **Example Values**: | |
| - `./runs/cpt_run_v1` | |
| - `/workspace/outputs/my_experiment` | |
| - `./checkpoints/cpt_experiment` | |
| ### `run.seed` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: None | |
| - **Description**: Random seed for reproducibility | |
| - **Used in**: Lines ~460, ~240 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| set_seed(int(cfg["run"].get("seed", 42))) | |
| # Used in data shuffling and train/test split | |
| ``` | |
| - **Example Values**: `42`, `123`, `2023` | |
| --- | |
| ## Model Parameters | |
| ### `model.repo_id` | |
| - **Type**: String (path or HuggingFace repo) | |
| - **Required**: Yes | |
| - **Default**: No default | |
| - **Description**: Model identifier - can be local path or HuggingFace repository | |
| - **Used in**: Lines ~480-500 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| repo_id = str(model_cfg["repo_id"]).strip() | |
| repo_path = Path(repo_id) | |
| if repo_path.exists() and repo_path.is_dir(): | |
| base_dir = repo_path # Local path | |
| else: | |
| # Download from HuggingFace | |
| snapshot_download(repo_id=repo_id, ...) | |
| ``` | |
| - **Example Values**: | |
| - Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512` | |
| - HF Repo: `meta-llama/Llama-2-7b-hf` | |
| ### `model.revision` | |
| - **Type**: String or null | |
| - **Required**: No | |
| - **Default**: null | |
| - **Description**: Specific model revision/branch/tag from HuggingFace | |
| - **Used in**: Line ~495 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| snapshot_download( | |
| repo_id=repo_id, | |
| revision=model_cfg.get("revision", None), | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: | |
| - `"main"` - Main branch | |
| - `"v1.0"` - Specific tag | |
| - `"abc123def"` - Specific commit hash | |
| - `null` - Latest version | |
| ### `model.base_local_dir` | |
| - **Type**: String (path) | |
| - **Required**: No | |
| - **Default**: `"base_model"` | |
| - **Description**: Directory name for downloaded model when using HF repo | |
| - **Used in**: Line ~495 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model")) | |
| ``` | |
| - **Example Values**: `"base_model"`, `"downloaded_model"`, `"model_files"` | |
| ### `model.trust_remote_code` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` | |
| - **Description**: Allow loading models with custom code | |
| - **Used in**: Lines ~320, ~340, ~450 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) | |
| model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `model.tokenizer_use_fast` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` | |
| - **Description**: Use fast tokenizer implementation | |
| - **Used in**: Lines ~320, ~450 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `model.device_map` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"auto"` | |
| - **Description**: How to distribute model across devices | |
| - **Used in**: Lines ~350, ~370 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...) | |
| ``` | |
| - **Example Values**: | |
| - `"auto"` - Automatic distribution | |
| - `"cpu"` - CPU only | |
| - `"cuda:0"` - Single GPU | |
| - `{"": 0}` - Manual mapping | |
| ### `model.torch_dtype` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"bfloat16"` | |
| - **Description**: Data type for model tensors | |
| - **Used in**: Lines ~45, ~350 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| def _dtype_from_str(s: str) -> torch.dtype: | |
| if s in ("float16", "fp16"): return torch.float16 | |
| if s in ("bfloat16", "bf16"): return torch.bfloat16 | |
| if s in ("float32", "fp32"): return torch.float32 | |
| ``` | |
| - **Example Values**: | |
| - `"float16"` - 16-bit floats (faster, less memory, less stable) | |
| - `"bfloat16"` - Brain float16 (stable, good for training) | |
| - `"float32"` - 32-bit floats (slowest, most memory) | |
| ### `model.use_4bit` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `false` | |
| - **Description**: Use 4-bit quantization for memory efficiency | |
| - **Used in**: Lines ~325, ~395 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| use_4bit = bool(model_cfg.get("use_4bit", False)) | |
| if use_4bit: | |
| quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `model.bnb_4bit_quant_type` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"nf4"` | |
| - **Description**: 4-bit quantization type | |
| - **Used in**: Lines ~328 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")) | |
| ``` | |
| - **Example Values**: | |
| - `"nf4"` - NormalFloat4 (recommended) | |
| - `"fp4"` - FloatingPoint4 | |
| - `"int4"` - Integer4 | |
| ### `model.bnb_4bit_use_double_quant` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `false` | |
| - **Description**: Use double quantization for memory efficiency | |
| - **Used in**: Lines ~329 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True)) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `model.bnb_4bit_compute_dtype` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"bfloat16"` | |
| - **Description**: Compute dtype for 4-bit quantization | |
| - **Used in**: Lines ~330 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")) | |
| ``` | |
| - **Example Values**: `"float16"`, `"bfloat16"`, `"float32"` | |
| ### `model.attn_implementation` | |
| - **Type**: String or null | |
| - **Required**: No | |
| - **Default**: `null` | |
| - **Description**: Attention implementation to use | |
| - **Used in**: Lines ~155, ~350 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]: | |
| return cfg.get("model", {}).get("attn_implementation", None) | |
| # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...) | |
| ``` | |
| - **Example Values**: | |
| - `"flash_attention_2"` - Flash Attention 2 (fastest) | |
| - `"sdpa"` - Scaled Dot-Product Attention | |
| - `null` - Default implementation | |
| --- | |
| ## Data Parameters | |
| ### `data.train_jsonl` | |
| - **Type**: String (path) | |
| - **Required**: Yes | |
| - **Default**: No default | |
| - **Description**: Path to training data in JSONL format | |
| - **Used in**: Lines ~170 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| train_path = data_cfg["train_jsonl"] | |
| ds = load_dataset("json", data_files={"train": train_path}) | |
| ``` | |
| - **Example Values**: `"/workspace/all_data_with_descriptions.jsonl"` | |
| ### `data.eval_jsonl` | |
| - **Type**: String (path) or null | |
| - **Required**: No | |
| - **Default**: `null` | |
| - **Description**: Path to evaluation data in JSONL format | |
| - **Used in**: Lines ~175 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| eval_path = data_cfg.get("eval_jsonl", None) | |
| if eval_path: | |
| ds_eval = load_dataset("json", data_files={"eval": eval_path}) | |
| ``` | |
| - **Example Values**: `null` (no separate eval file), `"/workspace/eval_data.jsonl"` | |
| ### `data.eval_split_ratio` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `0.1` | |
| - **Description**: Ratio of training data to use for evaluation split | |
| - **Used in**: Lines ~177 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| split_ratio = float(data_cfg.get("eval_split_ratio", 0.0)) | |
| if 0.0 < split_ratio < 1.0: | |
| split = ds["train"].train_test_split(test_size=split_ratio, seed=seed) | |
| ``` | |
| - **Example Values**: `0.1` (10%), `0.2` (20%), `0.05` (5%) | |
| ### `data.text_field` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"text"` | |
| - **Description**: Field name in JSONL containing the text data | |
| - **Used in**: Lines ~185 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| text_field = data_cfg.get("text_field", "text") | |
| # Used in tokenization | |
| tokenized = dsd["train"].map( | |
| tokenize_fn, | |
| batched=True, | |
| remove_columns=dsd["train"].column_names, | |
| desc="Tokenizing train", | |
| ) | |
| ``` | |
| - **Example Values**: `"text"`, `"content"`, `"prompt"`, `"input"` | |
| ### `data.block_size` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `4096` | |
| - **Description**: Maximum sequence length for training | |
| - **Used in**: Lines ~180 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| block_size = int(data_cfg.get("block_size", 2048)) | |
| # Used in grouping texts into blocks | |
| for i in range(0, full_len, block_size): | |
| chunk = concatenated["input_ids"][i:i + block_size] | |
| ``` | |
| - **Example Values**: `2048`, `4096`, `8192` | |
| ### `data.shuffle` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` | |
| - **Description**: Whether to shuffle training data | |
| - **Used in**: Lines ~235 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| if shuffle: | |
| tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42))) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `data.num_proc` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `4` | |
| - **Description**: Number of processes for data loading | |
| - **Used in**: Lines ~200, ~210 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| num_proc = int(data_cfg.get("num_proc", 4)) | |
| tokenized_train = dsd["train"].map( | |
| tokenize_fn, | |
| batched=True, | |
| num_proc=num_proc, | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: `1`, `4`, `8`, `16` | |
| ### `data.pack_mode` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"pad"` | |
| - **Description**: How to handle remainder tokens in final block | |
| - **Used in**: Lines ~150-230 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip() | |
| if pack_mode == "pad": | |
| # Pad remainder and mask loss | |
| labels[-pad_len:] = [-100] * pad_len | |
| # If "drop": ignore remainder entirely | |
| ``` | |
| - **Example Values**: | |
| - `"drop"` - Drop incomplete blocks (strict CPT) | |
| - `"pad"` - Pad incomplete blocks with masked loss | |
| --- | |
| ## PEFT Parameters | |
| ### `peft.enabled` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` | |
| - **Description**: Whether to use PEFT (Parameter-Efficient Fine-Tuning) | |
| - **Used in**: Lines ~395 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| if not bool(peft_cfg.get("enabled", True)): | |
| return model, None | |
| # Otherwise proceed with LoRA configuration | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `peft.r` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `64` | |
| - **Description**: LoRA rank - dimension of low-rank matrices | |
| - **Used in**: Lines ~415 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| lora_config = LoraConfig( | |
| r=int(peft_cfg.get("r", 16)), | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: `8`, `16`, `32`, `64`, `128` | |
| - **Note**: Higher values = more parameters but potentially better performance | |
| ### `peft.lora_alpha` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `128` | |
| - **Description**: LoRA alpha scaling parameter | |
| - **Used in**: Lines ~416 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| lora_config = LoraConfig( | |
| lora_alpha=int(peft_cfg.get("lora_alpha", 32)), | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: `16`, `32`, `64`, `128`, `256` | |
| ### `peft.lora_dropout` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `0.05` | |
| - **Description**: Dropout rate for LoRA layers | |
| - **Used in**: Lines ~417 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| lora_config = LoraConfig( | |
| lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)), | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: `0.0`, `0.05`, `0.1`, `0.2` | |
| ### `peft.bias` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"none"` | |
| - **Description**: Bias training strategy | |
| - **Used in**: Lines ~418 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| lora_config = LoraConfig( | |
| bias=str(peft_cfg.get("bias", "none")), | |
| ... | |
| ) | |
| ``` | |
| - **Example Values**: | |
| - `"none"` - No bias training | |
| - `"all"` - Train all biases | |
| - `"lora_only"` - Only LoRA bias | |
| ### `peft.target_modules` | |
| - **Type**: String or List | |
| - **Required**: No | |
| - **Default**: `"auto"` | |
| - **Description**: Which modules to apply LoRA to | |
| - **Used in**: Lines ~405, ~140-170 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| target_modules = peft_cfg.get("target_modules", "auto") | |
| if target_modules == "auto": | |
| target_modules = _infer_target_modules(model) | |
| ``` | |
| - **Example Values**: | |
| - `"auto"` - Automatic detection | |
| - `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list | |
| - `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only | |
| --- | |
| ## Training Parameters | |
| ### `train.num_train_epochs` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `2` | |
| - **Description**: Number of epochs to train | |
| - **Used in**: Lines ~470 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| num_train_epochs = float(tr_cfg.get("num_train_epochs", 1)) | |
| # Used in TrainingArguments | |
| ``` | |
| - **Example Values**: `1.0`, `2.0`, `3.5` | |
| ### `train.per_device_train_batch_size` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `1` | |
| - **Description**: Training batch size per device | |
| - **Used in**: Lines ~475 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)) | |
| ``` | |
| - **Example Values**: `1`, `2`, `4`, `8` | |
| ### `train.per_device_eval_batch_size` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: Same as train batch size | |
| - **Description**: Evaluation batch size per device | |
| - **Used in**: Lines ~476 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1))) | |
| ``` | |
| - **Example Values**: `1`, `2`, `4`, `8` | |
| ### `train.gradient_accumulation_steps` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `16` | |
| - **Description**: Number of steps to accumulate gradients | |
| - **Used in**: Lines ~477 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)) | |
| ``` | |
| - **Example Values**: `1`, `4`, `8`, `16`, `32` | |
| ### `train.learning_rate` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `2e-5` | |
| - **Description**: Learning rate for optimizer | |
| - **Used in**: Lines ~478 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| learning_rate=float(tr_cfg.get("learning_rate", 2e-5)) | |
| ``` | |
| - **Example Values**: `1e-5`, `2e-5`, `5e-5`, `1e-4` | |
| ### `train.weight_decay` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `0.0` | |
| - **Description**: Weight decay for regularization | |
| - **Used in**: Lines ~479 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| weight_decay=float(tr_cfg.get("weight_decay", 0.0)) | |
| ``` | |
| - **Example Values**: `0.0`, `0.01`, `0.1` | |
| ### `train.warmup_ratio` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `0.1` | |
| - **Description**: Ratio of steps for learning rate warmup | |
| - **Used in**: Lines ~480 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)) | |
| ``` | |
| - **Example Values**: `0.0`, `0.1`, `0.2` | |
| ### `train.lr_scheduler_type` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"cosine"` | |
| - **Description**: Learning rate scheduler type | |
| - **Used in**: Lines ~481 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")) | |
| ``` | |
| - **Example Values**: | |
| - `"cosine"` - Cosine annealing | |
| - `"linear"` - Linear decay | |
| - `"constant"` - Constant rate | |
| - `"polynomial"` - Polynomial decay | |
| ### `train.optim` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise) | |
| - **Description**: Optimizer type | |
| - **Used in**: Lines ~482 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch")) | |
| ``` | |
| - **Example Values**: | |
| - `"adamw_torch"` - AdamW (standard) | |
| - `"paged_adamw_8bit"` - Paged AdamW for 8-bit training | |
| - `"sgd"` - SGD | |
| - `"adafactor"` - Adafactor | |
| ### `train.max_grad_norm` | |
| - **Type**: Float | |
| - **Required**: No | |
| - **Default**: `1.0` | |
| - **Description**: Maximum gradient norm for clipping | |
| - **Used in**: Lines ~483 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)) | |
| ``` | |
| - **Example Values**: `0.5`, `1.0`, `2.0` | |
| ### `train.gradient_checkpointing` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` | |
| - **Description**: Use gradient checkpointing to save memory | |
| - **Used in**: Lines ~396-400 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True)) | |
| if gradient_checkpointing: | |
| model.gradient_checkpointing_enable() | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `train.logging_steps` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `1` | |
| - **Description**: Log training progress every N steps | |
| - **Used in**: Lines ~485 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| logging_steps=int(tr_cfg.get("logging_steps", 10)) | |
| ``` | |
| - **Example Values**: `1`, `10`, `50`, `100` | |
| ### `train.save_strategy` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"steps"` | |
| - **Description**: When to save model checkpoints | |
| - **Used in**: Lines ~487 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| save_strategy=str(tr_cfg.get("save_strategy", "steps")) | |
| ``` | |
| - **Example Values**: | |
| - `"steps"` - Save every N steps | |
| - `"epochs"` - Save every epoch | |
| - `"no"` - Don't save | |
| ### `train.save_steps` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `100` | |
| - **Description**: Save checkpoint every N steps | |
| - **Used in**: Lines ~488 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| save_steps=int(tr_cfg.get("save_steps", 200)) | |
| ``` | |
| - **Example Values**: `50`, `100`, `200`, `500` | |
| ### `train.save_total_limit` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `4` | |
| - **Description**: Maximum number of checkpoints to keep | |
| - **Used in**: Lines ~489 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| save_total_limit=int(tr_cfg.get("save_total_limit", 3)) | |
| ``` | |
| - **Example Values**: `1`, `2`, `3`, `5` | |
| ### `train.evaluation_strategy` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"steps"` (if eval data), `"no"` (otherwise) | |
| - **Description**: When to evaluate model | |
| - **Used in**: Lines ~494 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no")) | |
| ``` | |
| - **Example Values**: | |
| - `"steps"` - Evaluate every N steps | |
| - `"epochs"` - Evaluate every epoch | |
| - `"no"` - Don't evaluate | |
| ### `train.eval_steps` | |
| - **Type**: Integer | |
| - **Required**: No | |
| - **Default**: `50` | |
| - **Description**: Evaluate every N steps | |
| - **Used in**: Lines ~491 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| eval_steps=int(tr_cfg.get("eval_steps", 200)) | |
| ``` | |
| - **Example Values**: `25`, `50`, `100`, `200` | |
| ### `train.load_best_model_at_end` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `true` (if eval data), `false` (otherwise) | |
| - **Description**: Load best model at end of training | |
| - **Used in**: Lines ~492-493 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `train.resume_from_checkpoint` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"auto"` | |
| - **Description**: Resume training from checkpoint | |
| - **Used in**: Lines ~510-520 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| resume_from = tr_cfg.get("resume_from_checkpoint", None) | |
| if resume_from == "auto": | |
| last = get_last_checkpoint(str(ckpt_dir)) | |
| resume_from = last if last else None | |
| ``` | |
| - **Example Values**: | |
| - `"auto"` - Auto-detect latest checkpoint | |
| - `"checkpoint-100"` - Specific checkpoint | |
| - `null` - Start from scratch | |
| --- | |
| ## Merge Parameters | |
| ### `merge.enabled` | |
| - **Type**: Boolean | |
| - **Required**: No | |
| - **Default**: `false` | |
| - **Description**: Whether to merge LoRA adapters with base model | |
| - **Used in**: Lines ~545 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| if bool(cfg.get("merge", {}).get("enabled", False)): | |
| merge_adapter(cfg, base_dir, best_adapter_dir, final_dir) | |
| ``` | |
| - **Example Values**: `true`, `false` | |
| ### `merge.merged_dtype` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"float16"` | |
| - **Description**: Data type for merged model | |
| - **Used in**: Lines ~430 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16")) | |
| ``` | |
| - **Example Values**: `"float16"`, `"bfloat16"`, `"float32"` | |
| ### `merge.max_shard_size` | |
| - **Type**: String | |
| - **Required**: No | |
| - **Default**: `"2GB"` | |
| - **Description**: Maximum size per shard when saving | |
| - **Used in**: Lines ~445 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size) | |
| ``` | |
| - **Example Values**: `"1GB"`, `"2GB"`, `"5GB"` | |
| ### `merge.output_dir` | |
| - **Type**: String (path) | |
| - **Required**: No | |
| - **Default**: `"./merged_model"` | |
| - **Description**: Directory for merged model output | |
| - **Used in**: Lines ~505-510 in `run_cpt.py` | |
| - **Implementation**: | |
| ```python | |
| if merge_cfg.get("output_dir"): | |
| od = Path(str(merge_cfg["output_dir"])) | |
| final_dir = od if od.is_absolute() else (run_dir / od) | |
| else: | |
| final_dir = run_dir / "final_model" | |
| ``` | |
| - **Example Values**: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"` | |
| --- | |
| ## Parameter Dependencies and Interactions | |
| ### Memory-Related Dependencies | |
| - `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size | |
| - `block_size` affects memory usage significantly | |
| - `use_4bit` + `bnb_4bit_*` parameters work together for quantization | |
| - `gradient_checkpointing` can enable larger `block_size` or `batch_size` | |
| ### Training Strategy Dependencies | |
| - `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0` | |
| - `load_best_model_at_end` requires `evaluation_strategy` to be enabled | |
| - `save_strategy` should be compatible with `evaluation_strategy` | |
| - `lr_scheduler_type` affects warmup calculations | |
| ### Model-Specific Dependencies | |
| - `target_modules` must match the actual module names in your model | |
| - `torch_dtype` should be compatible with your GPU hardware | |
| - `device_map` affects whether you can use certain optimizations | |
| ### Data Processing Dependencies | |
| - `text_field` must exist in your JSONL data | |
| - `pack_mode: "pad"` requires `block_size` to be set appropriately | |
| - `eval_split_ratio` is ignored if `eval_jsonl` is provided | |
| This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints. | |