task2file-llm / trainer-kit /CPT /detailed_parameter_documentation.md
SirajRLX's picture
Add Training Scripts
e527a65 verified
# CPT Configuration Parameters: Detailed Guide
This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`.
## Table of Contents
- [Run Parameters](#run-parameters)
- [Model Parameters](#model-parameters)
- [Data Parameters](#data-parameters)
- [PEFT Parameters](#peft-parameters)
- [Training Parameters](#training-parameters)
- [Merge Parameters](#merge-parameters)
---
## Run Parameters
### `run.run_dir`
- **Type**: String (path)
- **Required**: Yes
- **Default**: No default
- **Description**: Directory where training outputs will be saved
- **Used in**: Line ~480 in `run_cpt.py`
- **Implementation**:
```python
run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
```
- **Example Values**:
- `./runs/cpt_run_v1`
- `/workspace/outputs/my_experiment`
- `./checkpoints/cpt_experiment`
### `run.seed`
- **Type**: Integer
- **Required**: No
- **Default**: None
- **Description**: Random seed for reproducibility
- **Used in**: Lines ~460, ~240 in `run_cpt.py`
- **Implementation**:
```python
set_seed(int(cfg["run"].get("seed", 42)))
# Used in data shuffling and train/test split
```
- **Example Values**: `42`, `123`, `2023`
---
## Model Parameters
### `model.repo_id`
- **Type**: String (path or HuggingFace repo)
- **Required**: Yes
- **Default**: No default
- **Description**: Model identifier - can be local path or HuggingFace repository
- **Used in**: Lines ~480-500 in `run_cpt.py`
- **Implementation**:
```python
repo_id = str(model_cfg["repo_id"]).strip()
repo_path = Path(repo_id)
if repo_path.exists() and repo_path.is_dir():
base_dir = repo_path # Local path
else:
# Download from HuggingFace
snapshot_download(repo_id=repo_id, ...)
```
- **Example Values**:
- Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512`
- HF Repo: `meta-llama/Llama-2-7b-hf`
### `model.revision`
- **Type**: String or null
- **Required**: No
- **Default**: null
- **Description**: Specific model revision/branch/tag from HuggingFace
- **Used in**: Line ~495 in `run_cpt.py`
- **Implementation**:
```python
snapshot_download(
repo_id=repo_id,
revision=model_cfg.get("revision", None),
...
)
```
- **Example Values**:
- `"main"` - Main branch
- `"v1.0"` - Specific tag
- `"abc123def"` - Specific commit hash
- `null` - Latest version
### `model.base_local_dir`
- **Type**: String (path)
- **Required**: No
- **Default**: `"base_model"`
- **Description**: Directory name for downloaded model when using HF repo
- **Used in**: Line ~495 in `run_cpt.py`
- **Implementation**:
```python
base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
```
- **Example Values**: `"base_model"`, `"downloaded_model"`, `"model_files"`
### `model.trust_remote_code`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Allow loading models with custom code
- **Used in**: Lines ~320, ~340, ~450 in `run_cpt.py`
- **Implementation**:
```python
tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)
```
- **Example Values**: `true`, `false`
### `model.tokenizer_use_fast`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Use fast tokenizer implementation
- **Used in**: Lines ~320, ~450 in `run_cpt.py`
- **Implementation**:
```python
tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
```
- **Example Values**: `true`, `false`
### `model.device_map`
- **Type**: String
- **Required**: No
- **Default**: `"auto"`
- **Description**: How to distribute model across devices
- **Used in**: Lines ~350, ~370 in `run_cpt.py`
- **Implementation**:
```python
model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)
```
- **Example Values**:
- `"auto"` - Automatic distribution
- `"cpu"` - CPU only
- `"cuda:0"` - Single GPU
- `{"": 0}` - Manual mapping
### `model.torch_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"bfloat16"`
- **Description**: Data type for model tensors
- **Used in**: Lines ~45, ~350 in `run_cpt.py`
- **Implementation**:
```python
def _dtype_from_str(s: str) -> torch.dtype:
if s in ("float16", "fp16"): return torch.float16
if s in ("bfloat16", "bf16"): return torch.bfloat16
if s in ("float32", "fp32"): return torch.float32
```
- **Example Values**:
- `"float16"` - 16-bit floats (faster, less memory, less stable)
- `"bfloat16"` - Brain float16 (stable, good for training)
- `"float32"` - 32-bit floats (slowest, most memory)
### `model.use_4bit`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Use 4-bit quantization for memory efficiency
- **Used in**: Lines ~325, ~395 in `run_cpt.py`
- **Implementation**:
```python
use_4bit = bool(model_cfg.get("use_4bit", False))
if use_4bit:
quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)
```
- **Example Values**: `true`, `false`
### `model.bnb_4bit_quant_type`
- **Type**: String
- **Required**: No
- **Default**: `"nf4"`
- **Description**: 4-bit quantization type
- **Used in**: Lines ~328 in `run_cpt.py`
- **Implementation**:
```python
bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))
```
- **Example Values**:
- `"nf4"` - NormalFloat4 (recommended)
- `"fp4"` - FloatingPoint4
- `"int4"` - Integer4
### `model.bnb_4bit_use_double_quant`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Use double quantization for memory efficiency
- **Used in**: Lines ~329 in `run_cpt.py`
- **Implementation**:
```python
bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))
```
- **Example Values**: `true`, `false`
### `model.bnb_4bit_compute_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"bfloat16"`
- **Description**: Compute dtype for 4-bit quantization
- **Used in**: Lines ~330 in `run_cpt.py`
- **Implementation**:
```python
bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))
```
- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`
### `model.attn_implementation`
- **Type**: String or null
- **Required**: No
- **Default**: `null`
- **Description**: Attention implementation to use
- **Used in**: Lines ~155, ~350 in `run_cpt.py`
- **Implementation**:
```python
def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
return cfg.get("model", {}).get("attn_implementation", None)
# Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)
```
- **Example Values**:
- `"flash_attention_2"` - Flash Attention 2 (fastest)
- `"sdpa"` - Scaled Dot-Product Attention
- `null` - Default implementation
---
## Data Parameters
### `data.train_jsonl`
- **Type**: String (path)
- **Required**: Yes
- **Default**: No default
- **Description**: Path to training data in JSONL format
- **Used in**: Lines ~170 in `run_cpt.py`
- **Implementation**:
```python
train_path = data_cfg["train_jsonl"]
ds = load_dataset("json", data_files={"train": train_path})
```
- **Example Values**: `"/workspace/all_data_with_descriptions.jsonl"`
### `data.eval_jsonl`
- **Type**: String (path) or null
- **Required**: No
- **Default**: `null`
- **Description**: Path to evaluation data in JSONL format
- **Used in**: Lines ~175 in `run_cpt.py`
- **Implementation**:
```python
eval_path = data_cfg.get("eval_jsonl", None)
if eval_path:
ds_eval = load_dataset("json", data_files={"eval": eval_path})
```
- **Example Values**: `null` (no separate eval file), `"/workspace/eval_data.jsonl"`
### `data.eval_split_ratio`
- **Type**: Float
- **Required**: No
- **Default**: `0.1`
- **Description**: Ratio of training data to use for evaluation split
- **Used in**: Lines ~177 in `run_cpt.py`
- **Implementation**:
```python
split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
if 0.0 < split_ratio < 1.0:
split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)
```
- **Example Values**: `0.1` (10%), `0.2` (20%), `0.05` (5%)
### `data.text_field`
- **Type**: String
- **Required**: No
- **Default**: `"text"`
- **Description**: Field name in JSONL containing the text data
- **Used in**: Lines ~185 in `run_cpt.py`
- **Implementation**:
```python
text_field = data_cfg.get("text_field", "text")
# Used in tokenization
tokenized = dsd["train"].map(
tokenize_fn,
batched=True,
remove_columns=dsd["train"].column_names,
desc="Tokenizing train",
)
```
- **Example Values**: `"text"`, `"content"`, `"prompt"`, `"input"`
### `data.block_size`
- **Type**: Integer
- **Required**: No
- **Default**: `4096`
- **Description**: Maximum sequence length for training
- **Used in**: Lines ~180 in `run_cpt.py`
- **Implementation**:
```python
block_size = int(data_cfg.get("block_size", 2048))
# Used in grouping texts into blocks
for i in range(0, full_len, block_size):
chunk = concatenated["input_ids"][i:i + block_size]
```
- **Example Values**: `2048`, `4096`, `8192`
### `data.shuffle`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Whether to shuffle training data
- **Used in**: Lines ~235 in `run_cpt.py`
- **Implementation**:
```python
if shuffle:
tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
```
- **Example Values**: `true`, `false`
### `data.num_proc`
- **Type**: Integer
- **Required**: No
- **Default**: `4`
- **Description**: Number of processes for data loading
- **Used in**: Lines ~200, ~210 in `run_cpt.py`
- **Implementation**:
```python
num_proc = int(data_cfg.get("num_proc", 4))
tokenized_train = dsd["train"].map(
tokenize_fn,
batched=True,
num_proc=num_proc,
...
)
```
- **Example Values**: `1`, `4`, `8`, `16`
### `data.pack_mode`
- **Type**: String
- **Required**: No
- **Default**: `"pad"`
- **Description**: How to handle remainder tokens in final block
- **Used in**: Lines ~150-230 in `run_cpt.py`
- **Implementation**:
```python
pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
if pack_mode == "pad":
# Pad remainder and mask loss
labels[-pad_len:] = [-100] * pad_len
# If "drop": ignore remainder entirely
```
- **Example Values**:
- `"drop"` - Drop incomplete blocks (strict CPT)
- `"pad"` - Pad incomplete blocks with masked loss
---
## PEFT Parameters
### `peft.enabled`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
- **Used in**: Lines ~395 in `run_cpt.py`
- **Implementation**:
```python
if not bool(peft_cfg.get("enabled", True)):
return model, None
# Otherwise proceed with LoRA configuration
```
- **Example Values**: `true`, `false`
### `peft.r`
- **Type**: Integer
- **Required**: No
- **Default**: `64`
- **Description**: LoRA rank - dimension of low-rank matrices
- **Used in**: Lines ~415 in `run_cpt.py`
- **Implementation**:
```python
lora_config = LoraConfig(
r=int(peft_cfg.get("r", 16)),
...
)
```
- **Example Values**: `8`, `16`, `32`, `64`, `128`
- **Note**: Higher values = more parameters but potentially better performance
### `peft.lora_alpha`
- **Type**: Integer
- **Required**: No
- **Default**: `128`
- **Description**: LoRA alpha scaling parameter
- **Used in**: Lines ~416 in `run_cpt.py`
- **Implementation**:
```python
lora_config = LoraConfig(
lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
...
)
```
- **Example Values**: `16`, `32`, `64`, `128`, `256`
### `peft.lora_dropout`
- **Type**: Float
- **Required**: No
- **Default**: `0.05`
- **Description**: Dropout rate for LoRA layers
- **Used in**: Lines ~417 in `run_cpt.py`
- **Implementation**:
```python
lora_config = LoraConfig(
lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
...
)
```
- **Example Values**: `0.0`, `0.05`, `0.1`, `0.2`
### `peft.bias`
- **Type**: String
- **Required**: No
- **Default**: `"none"`
- **Description**: Bias training strategy
- **Used in**: Lines ~418 in `run_cpt.py`
- **Implementation**:
```python
lora_config = LoraConfig(
bias=str(peft_cfg.get("bias", "none")),
...
)
```
- **Example Values**:
- `"none"` - No bias training
- `"all"` - Train all biases
- `"lora_only"` - Only LoRA bias
### `peft.target_modules`
- **Type**: String or List
- **Required**: No
- **Default**: `"auto"`
- **Description**: Which modules to apply LoRA to
- **Used in**: Lines ~405, ~140-170 in `run_cpt.py`
- **Implementation**:
```python
target_modules = peft_cfg.get("target_modules", "auto")
if target_modules == "auto":
target_modules = _infer_target_modules(model)
```
- **Example Values**:
- `"auto"` - Automatic detection
- `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list
- `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only
---
## Training Parameters
### `train.num_train_epochs`
- **Type**: Float
- **Required**: No
- **Default**: `2`
- **Description**: Number of epochs to train
- **Used in**: Lines ~470 in `run_cpt.py`
- **Implementation**:
```python
num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
# Used in TrainingArguments
```
- **Example Values**: `1.0`, `2.0`, `3.5`
### `train.per_device_train_batch_size`
- **Type**: Integer
- **Required**: No
- **Default**: `1`
- **Description**: Training batch size per device
- **Used in**: Lines ~475 in `run_cpt.py`
- **Implementation**:
```python
per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))
```
- **Example Values**: `1`, `2`, `4`, `8`
### `train.per_device_eval_batch_size`
- **Type**: Integer
- **Required**: No
- **Default**: Same as train batch size
- **Description**: Evaluation batch size per device
- **Used in**: Lines ~476 in `run_cpt.py`
- **Implementation**:
```python
per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))
```
- **Example Values**: `1`, `2`, `4`, `8`
### `train.gradient_accumulation_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `16`
- **Description**: Number of steps to accumulate gradients
- **Used in**: Lines ~477 in `run_cpt.py`
- **Implementation**:
```python
gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))
```
- **Example Values**: `1`, `4`, `8`, `16`, `32`
### `train.learning_rate`
- **Type**: Float
- **Required**: No
- **Default**: `2e-5`
- **Description**: Learning rate for optimizer
- **Used in**: Lines ~478 in `run_cpt.py`
- **Implementation**:
```python
learning_rate=float(tr_cfg.get("learning_rate", 2e-5))
```
- **Example Values**: `1e-5`, `2e-5`, `5e-5`, `1e-4`
### `train.weight_decay`
- **Type**: Float
- **Required**: No
- **Default**: `0.0`
- **Description**: Weight decay for regularization
- **Used in**: Lines ~479 in `run_cpt.py`
- **Implementation**:
```python
weight_decay=float(tr_cfg.get("weight_decay", 0.0))
```
- **Example Values**: `0.0`, `0.01`, `0.1`
### `train.warmup_ratio`
- **Type**: Float
- **Required**: No
- **Default**: `0.1`
- **Description**: Ratio of steps for learning rate warmup
- **Used in**: Lines ~480 in `run_cpt.py`
- **Implementation**:
```python
warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))
```
- **Example Values**: `0.0`, `0.1`, `0.2`
### `train.lr_scheduler_type`
- **Type**: String
- **Required**: No
- **Default**: `"cosine"`
- **Description**: Learning rate scheduler type
- **Used in**: Lines ~481 in `run_cpt.py`
- **Implementation**:
```python
lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))
```
- **Example Values**:
- `"cosine"` - Cosine annealing
- `"linear"` - Linear decay
- `"constant"` - Constant rate
- `"polynomial"` - Polynomial decay
### `train.optim`
- **Type**: String
- **Required**: No
- **Default**: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise)
- **Description**: Optimizer type
- **Used in**: Lines ~482 in `run_cpt.py`
- **Implementation**:
```python
optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))
```
- **Example Values**:
- `"adamw_torch"` - AdamW (standard)
- `"paged_adamw_8bit"` - Paged AdamW for 8-bit training
- `"sgd"` - SGD
- `"adafactor"` - Adafactor
### `train.max_grad_norm`
- **Type**: Float
- **Required**: No
- **Default**: `1.0`
- **Description**: Maximum gradient norm for clipping
- **Used in**: Lines ~483 in `run_cpt.py`
- **Implementation**:
```python
max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))
```
- **Example Values**: `0.5`, `1.0`, `2.0`
### `train.gradient_checkpointing`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Use gradient checkpointing to save memory
- **Used in**: Lines ~396-400 in `run_cpt.py`
- **Implementation**:
```python
gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
if gradient_checkpointing:
model.gradient_checkpointing_enable()
```
- **Example Values**: `true`, `false`
### `train.logging_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `1`
- **Description**: Log training progress every N steps
- **Used in**: Lines ~485 in `run_cpt.py`
- **Implementation**:
```python
logging_steps=int(tr_cfg.get("logging_steps", 10))
```
- **Example Values**: `1`, `10`, `50`, `100`
### `train.save_strategy`
- **Type**: String
- **Required**: No
- **Default**: `"steps"`
- **Description**: When to save model checkpoints
- **Used in**: Lines ~487 in `run_cpt.py`
- **Implementation**:
```python
save_strategy=str(tr_cfg.get("save_strategy", "steps"))
```
- **Example Values**:
- `"steps"` - Save every N steps
- `"epochs"` - Save every epoch
- `"no"` - Don't save
### `train.save_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `100`
- **Description**: Save checkpoint every N steps
- **Used in**: Lines ~488 in `run_cpt.py`
- **Implementation**:
```python
save_steps=int(tr_cfg.get("save_steps", 200))
```
- **Example Values**: `50`, `100`, `200`, `500`
### `train.save_total_limit`
- **Type**: Integer
- **Required**: No
- **Default**: `4`
- **Description**: Maximum number of checkpoints to keep
- **Used in**: Lines ~489 in `run_cpt.py`
- **Implementation**:
```python
save_total_limit=int(tr_cfg.get("save_total_limit", 3))
```
- **Example Values**: `1`, `2`, `3`, `5`
### `train.evaluation_strategy`
- **Type**: String
- **Required**: No
- **Default**: `"steps"` (if eval data), `"no"` (otherwise)
- **Description**: When to evaluate model
- **Used in**: Lines ~494 in `run_cpt.py`
- **Implementation**:
```python
evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
```
- **Example Values**:
- `"steps"` - Evaluate every N steps
- `"epochs"` - Evaluate every epoch
- `"no"` - Don't evaluate
### `train.eval_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `50`
- **Description**: Evaluate every N steps
- **Used in**: Lines ~491 in `run_cpt.py`
- **Implementation**:
```python
eval_steps=int(tr_cfg.get("eval_steps", 200))
```
- **Example Values**: `25`, `50`, `100`, `200`
### `train.load_best_model_at_end`
- **Type**: Boolean
- **Required**: No
- **Default**: `true` (if eval data), `false` (otherwise)
- **Description**: Load best model at end of training
- **Used in**: Lines ~492-493 in `run_cpt.py`
- **Implementation**:
```python
load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False
```
- **Example Values**: `true`, `false`
### `train.resume_from_checkpoint`
- **Type**: String
- **Required**: No
- **Default**: `"auto"`
- **Description**: Resume training from checkpoint
- **Used in**: Lines ~510-520 in `run_cpt.py`
- **Implementation**:
```python
resume_from = tr_cfg.get("resume_from_checkpoint", None)
if resume_from == "auto":
last = get_last_checkpoint(str(ckpt_dir))
resume_from = last if last else None
```
- **Example Values**:
- `"auto"` - Auto-detect latest checkpoint
- `"checkpoint-100"` - Specific checkpoint
- `null` - Start from scratch
---
## Merge Parameters
### `merge.enabled`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Whether to merge LoRA adapters with base model
- **Used in**: Lines ~545 in `run_cpt.py`
- **Implementation**:
```python
if bool(cfg.get("merge", {}).get("enabled", False)):
merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
```
- **Example Values**: `true`, `false`
### `merge.merged_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"float16"`
- **Description**: Data type for merged model
- **Used in**: Lines ~430 in `run_cpt.py`
- **Implementation**:
```python
merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
```
- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`
### `merge.max_shard_size`
- **Type**: String
- **Required**: No
- **Default**: `"2GB"`
- **Description**: Maximum size per shard when saving
- **Used in**: Lines ~445 in `run_cpt.py`
- **Implementation**:
```python
merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
```
- **Example Values**: `"1GB"`, `"2GB"`, `"5GB"`
### `merge.output_dir`
- **Type**: String (path)
- **Required**: No
- **Default**: `"./merged_model"`
- **Description**: Directory for merged model output
- **Used in**: Lines ~505-510 in `run_cpt.py`
- **Implementation**:
```python
if merge_cfg.get("output_dir"):
od = Path(str(merge_cfg["output_dir"]))
final_dir = od if od.is_absolute() else (run_dir / od)
else:
final_dir = run_dir / "final_model"
```
- **Example Values**: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"`
---
## Parameter Dependencies and Interactions
### Memory-Related Dependencies
- `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size
- `block_size` affects memory usage significantly
- `use_4bit` + `bnb_4bit_*` parameters work together for quantization
- `gradient_checkpointing` can enable larger `block_size` or `batch_size`
### Training Strategy Dependencies
- `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0`
- `load_best_model_at_end` requires `evaluation_strategy` to be enabled
- `save_strategy` should be compatible with `evaluation_strategy`
- `lr_scheduler_type` affects warmup calculations
### Model-Specific Dependencies
- `target_modules` must match the actual module names in your model
- `torch_dtype` should be compatible with your GPU hardware
- `device_map` affects whether you can use certain optimizations
### Data Processing Dependencies
- `text_field` must exist in your JSONL data
- `pack_mode: "pad"` requires `block_size` to be set appropriately
- `eval_split_ratio` is ignored if `eval_jsonl` is provided
This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.