task2file-llm / trainer-kit /CPT /detailed_parameter_documentation.md

Add Training Scripts

e527a65 verified about 1 month ago

23.4 kB

	# CPT Configuration Parameters: Detailed Guide

	This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`.

	## Table of Contents
	- [Run Parameters](#run-parameters)
	- [Model Parameters](#model-parameters)
	- [Data Parameters](#data-parameters)
	- [PEFT Parameters](#peft-parameters)
	- [Training Parameters](#training-parameters)
	- [Merge Parameters](#merge-parameters)

	---

	## Run Parameters

	### `run.run_dir`
	- Type: String (path)
	- Required: Yes
	- Default: No default
	- Description: Directory where training outputs will be saved
	- Used in: Line ~480 in `run_cpt.py`
	- Implementation:
	```python
	run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
	```
	- Example Values:
	- `./runs/cpt_run_v1`
	- `/workspace/outputs/my_experiment`
	- `./checkpoints/cpt_experiment`

	### `run.seed`
	- Type: Integer
	- Required: No
	- Default: None
	- Description: Random seed for reproducibility
	- Used in: Lines ~460, ~240 in `run_cpt.py`
	- Implementation:
	```python
	set_seed(int(cfg["run"].get("seed", 42)))
	# Used in data shuffling and train/test split
	```
	- Example Values: `42`, `123`, `2023`

	---

	## Model Parameters

	### `model.repo_id`
	- Type: String (path or HuggingFace repo)
	- Required: Yes
	- Default: No default
	- Description: Model identifier - can be local path or HuggingFace repository
	- Used in: Lines ~480-500 in `run_cpt.py`
	- Implementation:
	```python
	repo_id = str(model_cfg["repo_id"]).strip()
	repo_path = Path(repo_id)
	if repo_path.exists() and repo_path.is_dir():
	base_dir = repo_path # Local path
	else:
	# Download from HuggingFace
	snapshot_download(repo_id=repo_id, ...)
	```
	- Example Values:
	- Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512`
	- HF Repo: `meta-llama/Llama-2-7b-hf`

	### `model.revision`
	- Type: String or null
	- Required: No
	- Default: null
	- Description: Specific model revision/branch/tag from HuggingFace
	- Used in: Line ~495 in `run_cpt.py`
	- Implementation:
	```python
	snapshot_download(
	repo_id=repo_id,
	revision=model_cfg.get("revision", None),
	...
	)
	```
	- Example Values:
	- `"main"` - Main branch
	- `"v1.0"` - Specific tag
	- `"abc123def"` - Specific commit hash
	- `null` - Latest version

	### `model.base_local_dir`
	- Type: String (path)
	- Required: No
	- Default: `"base_model"`
	- Description: Directory name for downloaded model when using HF repo
	- Used in: Line ~495 in `run_cpt.py`
	- Implementation:
	```python
	base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
	```
	- Example Values: `"base_model"`, `"downloaded_model"`, `"model_files"`

	### `model.trust_remote_code`
	- Type: Boolean
	- Required: No
	- Default: `true`
	- Description: Allow loading models with custom code
	- Used in: Lines ~320, ~340, ~450 in `run_cpt.py`
	- Implementation:
	```python
	tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
	model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)
	```
	- Example Values: `true`, `false`

	### `model.tokenizer_use_fast`
	- Type: Boolean
	- Required: No
	- Default: `true`
	- Description: Use fast tokenizer implementation
	- Used in: Lines ~320, ~450 in `run_cpt.py`
	- Implementation:
	```python
	tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
	```
	- Example Values: `true`, `false`

	### `model.device_map`
	- Type: String
	- Required: No
	- Default: `"auto"`
	- Description: How to distribute model across devices
	- Used in: Lines ~350, ~370 in `run_cpt.py`
	- Implementation:
	```python
	model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)
	```
	- Example Values:
	- `"auto"` - Automatic distribution
	- `"cpu"` - CPU only
	- `"cuda:0"` - Single GPU
	- `{"": 0}` - Manual mapping

	### `model.torch_dtype`
	- Type: String
	- Required: No
	- Default: `"bfloat16"`
	- Description: Data type for model tensors
	- Used in: Lines ~45, ~350 in `run_cpt.py`
	- Implementation:
	```python
	def _dtype_from_str(s: str) -> torch.dtype:
	if s in ("float16", "fp16"): return torch.float16
	if s in ("bfloat16", "bf16"): return torch.bfloat16
	if s in ("float32", "fp32"): return torch.float32
	```
	- Example Values:
	- `"float16"` - 16-bit floats (faster, less memory, less stable)
	- `"bfloat16"` - Brain float16 (stable, good for training)
	- `"float32"` - 32-bit floats (slowest, most memory)

	### `model.use_4bit`
	- Type: Boolean
	- Required: No
	- Default: `false`
	- Description: Use 4-bit quantization for memory efficiency
	- Used in: Lines ~325, ~395 in `run_cpt.py`
	- Implementation:
	```python
	use_4bit = bool(model_cfg.get("use_4bit", False))
	if use_4bit:
	quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)
	```
	- Example Values: `true`, `false`

	### `model.bnb_4bit_quant_type`
	- Type: String
	- Required: No
	- Default: `"nf4"`
	- Description: 4-bit quantization type
	- Used in: Lines ~328 in `run_cpt.py`
	- Implementation:
	```python
	bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))
	```
	- Example Values:
	- `"nf4"` - NormalFloat4 (recommended)
	- `"fp4"` - FloatingPoint4
	- `"int4"` - Integer4

	### `model.bnb_4bit_use_double_quant`
	- Type: Boolean
	- Required: No
	- Default: `false`
	- Description: Use double quantization for memory efficiency
	- Used in: Lines ~329 in `run_cpt.py`
	- Implementation:
	```python
	bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))
	```
	- Example Values: `true`, `false`

	### `model.bnb_4bit_compute_dtype`
	- Type: String
	- Required: No
	- Default: `"bfloat16"`
	- Description: Compute dtype for 4-bit quantization
	- Used in: Lines ~330 in `run_cpt.py`
	- Implementation:
	```python
	bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))
	```
	- Example Values: `"float16"`, `"bfloat16"`, `"float32"`

	### `model.attn_implementation`
	- Type: String or null
	- Required: No
	- Default: `null`
	- Description: Attention implementation to use
	- Used in: Lines ~155, ~350 in `run_cpt.py`
	- Implementation:
	```python
	def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
	return cfg.get("model", {}).get("attn_implementation", None)
	# Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)
	```
	- Example Values:
	- `"flash_attention_2"` - Flash Attention 2 (fastest)
	- `"sdpa"` - Scaled Dot-Product Attention
	- `null` - Default implementation

	---

	## Data Parameters

	### `data.train_jsonl`
	- Type: String (path)
	- Required: Yes
	- Default: No default
	- Description: Path to training data in JSONL format
	- Used in: Lines ~170 in `run_cpt.py`
	- Implementation:
	```python
	train_path = data_cfg["train_jsonl"]
	ds = load_dataset("json", data_files={"train": train_path})
	```
	- Example Values: `"/workspace/all_data_with_descriptions.jsonl"`

	### `data.eval_jsonl`
	- Type: String (path) or null
	- Required: No
	- Default: `null`
	- Description: Path to evaluation data in JSONL format
	- Used in: Lines ~175 in `run_cpt.py`
	- Implementation:
	```python
	eval_path = data_cfg.get("eval_jsonl", None)
	if eval_path:
	ds_eval = load_dataset("json", data_files={"eval": eval_path})
	```
	- Example Values: `null` (no separate eval file), `"/workspace/eval_data.jsonl"`

	### `data.eval_split_ratio`
	- Type: Float
	- Required: No
	- Default: `0.1`
	- Description: Ratio of training data to use for evaluation split
	- Used in: Lines ~177 in `run_cpt.py`
	- Implementation:
	```python
	split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
	if 0.0 < split_ratio < 1.0:
	split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)
	```
	- Example Values: `0.1` (10%), `0.2` (20%), `0.05` (5%)

	### `data.text_field`
	- Type: String
	- Required: No
	- Default: `"text"`
	- Description: Field name in JSONL containing the text data
	- Used in: Lines ~185 in `run_cpt.py`
	- Implementation:
	```python
	text_field = data_cfg.get("text_field", "text")
	# Used in tokenization
	tokenized = dsd["train"].map(
	tokenize_fn,
	batched=True,
	remove_columns=dsd["train"].column_names,
	desc="Tokenizing train",
	)
	```
	- Example Values: `"text"`, `"content"`, `"prompt"`, `"input"`

	### `data.block_size`
	- Type: Integer
	- Required: No
	- Default: `4096`
	- Description: Maximum sequence length for training
	- Used in: Lines ~180 in `run_cpt.py`
	- Implementation:
	```python
	block_size = int(data_cfg.get("block_size", 2048))
	# Used in grouping texts into blocks
	for i in range(0, full_len, block_size):
	chunk = concatenated["input_ids"][i:i + block_size]
	```
	- Example Values: `2048`, `4096`, `8192`

	### `data.shuffle`
	- Type: Boolean
	- Required: No
	- Default: `true`
	- Description: Whether to shuffle training data
	- Used in: Lines ~235 in `run_cpt.py`
	- Implementation:
	```python
	if shuffle:
	tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
	```
	- Example Values: `true`, `false`

	### `data.num_proc`
	- Type: Integer
	- Required: No
	- Default: `4`
	- Description: Number of processes for data loading
	- Used in: Lines ~200, ~210 in `run_cpt.py`
	- Implementation:
	```python
	num_proc = int(data_cfg.get("num_proc", 4))
	tokenized_train = dsd["train"].map(
	tokenize_fn,
	batched=True,
	num_proc=num_proc,
	...
	)
	```
	- Example Values: `1`, `4`, `8`, `16`

	### `data.pack_mode`
	- Type: String
	- Required: No
	- Default: `"pad"`
	- Description: How to handle remainder tokens in final block
	- Used in: Lines ~150-230 in `run_cpt.py`
	- Implementation:
	```python
	pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
	if pack_mode == "pad":
	# Pad remainder and mask loss
	labels[-pad_len:] = [-100] * pad_len
	# If "drop": ignore remainder entirely
	```
	- Example Values:
	- `"drop"` - Drop incomplete blocks (strict CPT)
	- `"pad"` - Pad incomplete blocks with masked loss

	---

	## PEFT Parameters

	### `peft.enabled`
	- Type: Boolean
	- Required: No
	- Default: `true`
	- Description: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
	- Used in: Lines ~395 in `run_cpt.py`
	- Implementation:
	```python
	if not bool(peft_cfg.get("enabled", True)):
	return model, None
	# Otherwise proceed with LoRA configuration
	```
	- Example Values: `true`, `false`

	### `peft.r`
	- Type: Integer
	- Required: No
	- Default: `64`
	- Description: LoRA rank - dimension of low-rank matrices
	- Used in: Lines ~415 in `run_cpt.py`
	- Implementation:
	```python
	lora_config = LoraConfig(
	r=int(peft_cfg.get("r", 16)),
	...
	)
	```
	- Example Values: `8`, `16`, `32`, `64`, `128`
	- Note: Higher values = more parameters but potentially better performance

	### `peft.lora_alpha`
	- Type: Integer
	- Required: No
	- Default: `128`
	- Description: LoRA alpha scaling parameter
	- Used in: Lines ~416 in `run_cpt.py`
	- Implementation:
	```python
	lora_config = LoraConfig(
	lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
	...
	)
	```
	- Example Values: `16`, `32`, `64`, `128`, `256`

	### `peft.lora_dropout`
	- Type: Float
	- Required: No
	- Default: `0.05`
	- Description: Dropout rate for LoRA layers
	- Used in: Lines ~417 in `run_cpt.py`
	- Implementation:
	```python
	lora_config = LoraConfig(
	lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
	...
	)
	```
	- Example Values: `0.0`, `0.05`, `0.1`, `0.2`

	### `peft.bias`
	- Type: String
	- Required: No
	- Default: `"none"`
	- Description: Bias training strategy
	- Used in: Lines ~418 in `run_cpt.py`
	- Implementation:
	```python
	lora_config = LoraConfig(
	bias=str(peft_cfg.get("bias", "none")),
	...
	)
	```
	- Example Values:
	- `"none"` - No bias training
	- `"all"` - Train all biases
	- `"lora_only"` - Only LoRA bias

	### `peft.target_modules`
	- Type: String or List
	- Required: No
	- Default: `"auto"`
	- Description: Which modules to apply LoRA to
	- Used in: Lines ~405, ~140-170 in `run_cpt.py`
	- Implementation:
	```python
	target_modules = peft_cfg.get("target_modules", "auto")
	if target_modules == "auto":
	target_modules = _infer_target_modules(model)
	```
	- Example Values:
	- `"auto"` - Automatic detection
	- `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list
	- `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only

	---

	## Training Parameters

	### `train.num_train_epochs`
	- Type: Float
	- Required: No
	- Default: `2`
	- Description: Number of epochs to train
	- Used in: Lines ~470 in `run_cpt.py`
	- Implementation:
	```python
	num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
	# Used in TrainingArguments
	```
	- Example Values: `1.0`, `2.0`, `3.5`

	### `train.per_device_train_batch_size`
	- Type: Integer
	- Required: No
	- Default: `1`
	- Description: Training batch size per device
	- Used in: Lines ~475 in `run_cpt.py`
	- Implementation:
	```python
	per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))
	```
	- Example Values: `1`, `2`, `4`, `8`

	### `train.per_device_eval_batch_size`
	- Type: Integer
	- Required: No
	- Default: Same as train batch size
	- Description: Evaluation batch size per device
	- Used in: Lines ~476 in `run_cpt.py`
	- Implementation:
	```python
	per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))
	```
	- Example Values: `1`, `2`, `4`, `8`

	### `train.gradient_accumulation_steps`
	- Type: Integer
	- Required: No
	- Default: `16`
	- Description: Number of steps to accumulate gradients
	- Used in: Lines ~477 in `run_cpt.py`
	- Implementation:
	```python
	gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))
	```
	- Example Values: `1`, `4`, `8`, `16`, `32`

	### `train.learning_rate`
	- Type: Float
	- Required: No
	- Default: `2e-5`
	- Description: Learning rate for optimizer
	- Used in: Lines ~478 in `run_cpt.py`
	- Implementation:
	```python
	learning_rate=float(tr_cfg.get("learning_rate", 2e-5))
	```
	- Example Values: `1e-5`, `2e-5`, `5e-5`, `1e-4`

	### `train.weight_decay`
	- Type: Float
	- Required: No
	- Default: `0.0`
	- Description: Weight decay for regularization
	- Used in: Lines ~479 in `run_cpt.py`
	- Implementation:
	```python
	weight_decay=float(tr_cfg.get("weight_decay", 0.0))
	```
	- Example Values: `0.0`, `0.01`, `0.1`

	### `train.warmup_ratio`
	- Type: Float
	- Required: No
	- Default: `0.1`
	- Description: Ratio of steps for learning rate warmup
	- Used in: Lines ~480 in `run_cpt.py`
	- Implementation:
	```python
	warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))
	```
	- Example Values: `0.0`, `0.1`, `0.2`

	### `train.lr_scheduler_type`
	- Type: String
	- Required: No
	- Default: `"cosine"`
	- Description: Learning rate scheduler type
	- Used in: Lines ~481 in `run_cpt.py`
	- Implementation:
	```python
	lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))
	```
	- Example Values:
	- `"cosine"` - Cosine annealing
	- `"linear"` - Linear decay
	- `"constant"` - Constant rate
	- `"polynomial"` - Polynomial decay

	### `train.optim`
	- Type: String
	- Required: No
	- Default: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise)
	- Description: Optimizer type
	- Used in: Lines ~482 in `run_cpt.py`
	- Implementation:
	```python
	optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))
	```
	- Example Values:
	- `"adamw_torch"` - AdamW (standard)
	- `"paged_adamw_8bit"` - Paged AdamW for 8-bit training
	- `"sgd"` - SGD
	- `"adafactor"` - Adafactor

	### `train.max_grad_norm`
	- Type: Float
	- Required: No
	- Default: `1.0`
	- Description: Maximum gradient norm for clipping
	- Used in: Lines ~483 in `run_cpt.py`
	- Implementation:
	```python
	max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))
	```
	- Example Values: `0.5`, `1.0`, `2.0`

	### `train.gradient_checkpointing`
	- Type: Boolean
	- Required: No
	- Default: `true`
	- Description: Use gradient checkpointing to save memory
	- Used in: Lines ~396-400 in `run_cpt.py`
	- Implementation:
	```python
	gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
	if gradient_checkpointing:
	model.gradient_checkpointing_enable()
	```
	- Example Values: `true`, `false`

	### `train.logging_steps`
	- Type: Integer
	- Required: No
	- Default: `1`
	- Description: Log training progress every N steps
	- Used in: Lines ~485 in `run_cpt.py`
	- Implementation:
	```python
	logging_steps=int(tr_cfg.get("logging_steps", 10))
	```
	- Example Values: `1`, `10`, `50`, `100`

	### `train.save_strategy`
	- Type: String
	- Required: No
	- Default: `"steps"`
	- Description: When to save model checkpoints
	- Used in: Lines ~487 in `run_cpt.py`
	- Implementation:
	```python
	save_strategy=str(tr_cfg.get("save_strategy", "steps"))
	```
	- Example Values:
	- `"steps"` - Save every N steps
	- `"epochs"` - Save every epoch
	- `"no"` - Don't save

	### `train.save_steps`
	- Type: Integer
	- Required: No
	- Default: `100`
	- Description: Save checkpoint every N steps
	- Used in: Lines ~488 in `run_cpt.py`
	- Implementation:
	```python
	save_steps=int(tr_cfg.get("save_steps", 200))
	```
	- Example Values: `50`, `100`, `200`, `500`

	### `train.save_total_limit`
	- Type: Integer
	- Required: No
	- Default: `4`
	- Description: Maximum number of checkpoints to keep
	- Used in: Lines ~489 in `run_cpt.py`
	- Implementation:
	```python
	save_total_limit=int(tr_cfg.get("save_total_limit", 3))
	```
	- Example Values: `1`, `2`, `3`, `5`

	### `train.evaluation_strategy`
	- Type: String
	- Required: No
	- Default: `"steps"` (if eval data), `"no"` (otherwise)
	- Description: When to evaluate model
	- Used in: Lines ~494 in `run_cpt.py`
	- Implementation:
	```python
	evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
	```
	- Example Values:
	- `"steps"` - Evaluate every N steps
	- `"epochs"` - Evaluate every epoch
	- `"no"` - Don't evaluate

	### `train.eval_steps`
	- Type: Integer
	- Required: No
	- Default: `50`
	- Description: Evaluate every N steps
	- Used in: Lines ~491 in `run_cpt.py`
	- Implementation:
	```python
	eval_steps=int(tr_cfg.get("eval_steps", 200))
	```
	- Example Values: `25`, `50`, `100`, `200`

	### `train.load_best_model_at_end`
	- Type: Boolean
	- Required: No
	- Default: `true` (if eval data), `false` (otherwise)
	- Description: Load best model at end of training
	- Used in: Lines ~492-493 in `run_cpt.py`
	- Implementation:
	```python
	load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False
	```
	- Example Values: `true`, `false`

	### `train.resume_from_checkpoint`
	- Type: String
	- Required: No
	- Default: `"auto"`
	- Description: Resume training from checkpoint
	- Used in: Lines ~510-520 in `run_cpt.py`
	- Implementation:
	```python
	resume_from = tr_cfg.get("resume_from_checkpoint", None)
	if resume_from == "auto":
	last = get_last_checkpoint(str(ckpt_dir))
	resume_from = last if last else None
	```
	- Example Values:
	- `"auto"` - Auto-detect latest checkpoint
	- `"checkpoint-100"` - Specific checkpoint
	- `null` - Start from scratch

	---

	## Merge Parameters

	### `merge.enabled`
	- Type: Boolean
	- Required: No
	- Default: `false`
	- Description: Whether to merge LoRA adapters with base model
	- Used in: Lines ~545 in `run_cpt.py`
	- Implementation:
	```python
	if bool(cfg.get("merge", {}).get("enabled", False)):
	merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
	```
	- Example Values: `true`, `false`

	### `merge.merged_dtype`
	- Type: String
	- Required: No
	- Default: `"float16"`
	- Description: Data type for merged model
	- Used in: Lines ~430 in `run_cpt.py`
	- Implementation:
	```python
	merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
	```
	- Example Values: `"float16"`, `"bfloat16"`, `"float32"`

	### `merge.max_shard_size`
	- Type: String
	- Required: No
	- Default: `"2GB"`
	- Description: Maximum size per shard when saving
	- Used in: Lines ~445 in `run_cpt.py`
	- Implementation:
	```python
	merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
	```
	- Example Values: `"1GB"`, `"2GB"`, `"5GB"`

	### `merge.output_dir`
	- Type: String (path)
	- Required: No
	- Default: `"./merged_model"`
	- Description: Directory for merged model output
	- Used in: Lines ~505-510 in `run_cpt.py`
	- Implementation:
	```python
	if merge_cfg.get("output_dir"):
	od = Path(str(merge_cfg["output_dir"]))
	final_dir = od if od.is_absolute() else (run_dir / od)
	else:
	final_dir = run_dir / "final_model"
	```
	- Example Values: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"`

	---

	## Parameter Dependencies and Interactions

	### Memory-Related Dependencies
	- `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size
	- `block_size` affects memory usage significantly
	- `use_4bit` + `bnb_4bit_*` parameters work together for quantization
	- `gradient_checkpointing` can enable larger `block_size` or `batch_size`

	### Training Strategy Dependencies
	- `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0`
	- `load_best_model_at_end` requires `evaluation_strategy` to be enabled
	- `save_strategy` should be compatible with `evaluation_strategy`
	- `lr_scheduler_type` affects warmup calculations

	### Model-Specific Dependencies
	- `target_modules` must match the actual module names in your model
	- `torch_dtype` should be compatible with your GPU hardware
	- `device_map` affects whether you can use certain optimizations

	### Data Processing Dependencies
	- `text_field` must exist in your JSONL data
	- `pack_mode: "pad"` requires `block_size` to be set appropriately
	- `eval_split_ratio` is ignored if `eval_jsonl` is provided

	This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.