# CPT Configuration Parameters: Detailed Guide

This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`.

## Table of Contents
- [Run Parameters](#run-parameters)
- [Model Parameters](#model-parameters) 
- [Data Parameters](#data-parameters)
- [PEFT Parameters](#peft-parameters)
- [Training Parameters](#training-parameters)
- [Merge Parameters](#merge-parameters)

---

## Run Parameters

### `run.run_dir`
- **Type**: String (path)
- **Required**: Yes
- **Default**: No default
- **Description**: Directory where training outputs will be saved
- **Used in**: Line ~480 in `run_cpt.py`
- **Implementation**: 
  ```python
  run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
  ```
- **Example Values**:
  - `./runs/cpt_run_v1`
  - `/workspace/outputs/my_experiment`
  - `./checkpoints/cpt_experiment`

### `run.seed`
- **Type**: Integer
- **Required**: No
- **Default**: None
- **Description**: Random seed for reproducibility
- **Used in**: Lines ~460, ~240 in `run_cpt.py`
- **Implementation**:
  ```python
  set_seed(int(cfg["run"].get("seed", 42)))
  # Used in data shuffling and train/test split
  ```
- **Example Values**: `42`, `123`, `2023`

---

## Model Parameters

### `model.repo_id`
- **Type**: String (path or HuggingFace repo)
- **Required**: Yes
- **Default**: No default
- **Description**: Model identifier - can be local path or HuggingFace repository
- **Used in**: Lines ~480-500 in `run_cpt.py`
- **Implementation**:
  ```python
  repo_id = str(model_cfg["repo_id"]).strip()
  repo_path = Path(repo_id)
  if repo_path.exists() and repo_path.is_dir():
      base_dir = repo_path  # Local path
  else:
      # Download from HuggingFace
      snapshot_download(repo_id=repo_id, ...)
  ```
- **Example Values**:
  - Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512`
  - HF Repo: `meta-llama/Llama-2-7b-hf`

### `model.revision`
- **Type**: String or null
- **Required**: No
- **Default**: null
- **Description**: Specific model revision/branch/tag from HuggingFace
- **Used in**: Line ~495 in `run_cpt.py`
- **Implementation**:
  ```python
  snapshot_download(
      repo_id=repo_id,
      revision=model_cfg.get("revision", None),
      ...
  )
  ```
- **Example Values**:
  - `"main"` - Main branch
  - `"v1.0"` - Specific tag
  - `"abc123def"` - Specific commit hash
  - `null` - Latest version

### `model.base_local_dir`
- **Type**: String (path)
- **Required**: No
- **Default**: `"base_model"`
- **Description**: Directory name for downloaded model when using HF repo
- **Used in**: Line ~495 in `run_cpt.py`
- **Implementation**:
  ```python
  base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
  ```
- **Example Values**: `"base_model"`, `"downloaded_model"`, `"model_files"`

### `model.trust_remote_code`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Allow loading models with custom code
- **Used in**: Lines ~320, ~340, ~450 in `run_cpt.py`
- **Implementation**:
  ```python
  tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
  model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)
  ```
- **Example Values**: `true`, `false`

### `model.tokenizer_use_fast`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Use fast tokenizer implementation
- **Used in**: Lines ~320, ~450 in `run_cpt.py`
- **Implementation**:
  ```python
  tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
  ```
- **Example Values**: `true`, `false`

### `model.device_map`
- **Type**: String
- **Required**: No
- **Default**: `"auto"`
- **Description**: How to distribute model across devices
- **Used in**: Lines ~350, ~370 in `run_cpt.py`
- **Implementation**:
  ```python
  model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)
  ```
- **Example Values**:
  - `"auto"` - Automatic distribution
  - `"cpu"` - CPU only
  - `"cuda:0"` - Single GPU
  - `{"": 0}` - Manual mapping

### `model.torch_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"bfloat16"`
- **Description**: Data type for model tensors
- **Used in**: Lines ~45, ~350 in `run_cpt.py`
- **Implementation**:
  ```python
  def _dtype_from_str(s: str) -> torch.dtype:
      if s in ("float16", "fp16"): return torch.float16
      if s in ("bfloat16", "bf16"): return torch.bfloat16  
      if s in ("float32", "fp32"): return torch.float32
  ```
- **Example Values**:
  - `"float16"` - 16-bit floats (faster, less memory, less stable)
  - `"bfloat16"` - Brain float16 (stable, good for training)
  - `"float32"` - 32-bit floats (slowest, most memory)

### `model.use_4bit`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Use 4-bit quantization for memory efficiency
- **Used in**: Lines ~325, ~395 in `run_cpt.py`
- **Implementation**:
  ```python
  use_4bit = bool(model_cfg.get("use_4bit", False))
  if use_4bit:
      quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)
  ```
- **Example Values**: `true`, `false`

### `model.bnb_4bit_quant_type`
- **Type**: String
- **Required**: No
- **Default**: `"nf4"`
- **Description**: 4-bit quantization type
- **Used in**: Lines ~328 in `run_cpt.py`
- **Implementation**:
  ```python
  bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))
  ```
- **Example Values**:
  - `"nf4"` - NormalFloat4 (recommended)
  - `"fp4"` - FloatingPoint4
  - `"int4"` - Integer4

### `model.bnb_4bit_use_double_quant`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Use double quantization for memory efficiency
- **Used in**: Lines ~329 in `run_cpt.py`
- **Implementation**:
  ```python
  bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))
  ```
- **Example Values**: `true`, `false`

### `model.bnb_4bit_compute_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"bfloat16"`
- **Description**: Compute dtype for 4-bit quantization
- **Used in**: Lines ~330 in `run_cpt.py`
- **Implementation**:
  ```python
  bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))
  ```
- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`

### `model.attn_implementation`
- **Type**: String or null
- **Required**: No
- **Default**: `null`
- **Description**: Attention implementation to use
- **Used in**: Lines ~155, ~350 in `run_cpt.py`
- **Implementation**:
  ```python
  def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
      return cfg.get("model", {}).get("attn_implementation", None)
  # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)
  ```
- **Example Values**:
  - `"flash_attention_2"` - Flash Attention 2 (fastest)
  - `"sdpa"` - Scaled Dot-Product Attention
  - `null` - Default implementation

---

## Data Parameters

### `data.train_jsonl`
- **Type**: String (path)
- **Required**: Yes
- **Default**: No default
- **Description**: Path to training data in JSONL format
- **Used in**: Lines ~170 in `run_cpt.py`
- **Implementation**:
  ```python
  train_path = data_cfg["train_jsonl"]
  ds = load_dataset("json", data_files={"train": train_path})
  ```
- **Example Values**: `"/workspace/all_data_with_descriptions.jsonl"`

### `data.eval_jsonl`
- **Type**: String (path) or null
- **Required**: No
- **Default**: `null`
- **Description**: Path to evaluation data in JSONL format
- **Used in**: Lines ~175 in `run_cpt.py`
- **Implementation**:
  ```python
  eval_path = data_cfg.get("eval_jsonl", None)
  if eval_path:
      ds_eval = load_dataset("json", data_files={"eval": eval_path})
  ```
- **Example Values**: `null` (no separate eval file), `"/workspace/eval_data.jsonl"`

### `data.eval_split_ratio`
- **Type**: Float
- **Required**: No
- **Default**: `0.1`
- **Description**: Ratio of training data to use for evaluation split
- **Used in**: Lines ~177 in `run_cpt.py`
- **Implementation**:
  ```python
  split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
  if 0.0 < split_ratio < 1.0:
      split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)
  ```
- **Example Values**: `0.1` (10%), `0.2` (20%), `0.05` (5%)

### `data.text_field`
- **Type**: String
- **Required**: No
- **Default**: `"text"`
- **Description**: Field name in JSONL containing the text data
- **Used in**: Lines ~185 in `run_cpt.py`
- **Implementation**:
  ```python
  text_field = data_cfg.get("text_field", "text")
  # Used in tokenization
  tokenized = dsd["train"].map(
      tokenize_fn,
      batched=True,
      remove_columns=dsd["train"].column_names,
      desc="Tokenizing train",
  )
  ```
- **Example Values**: `"text"`, `"content"`, `"prompt"`, `"input"`

### `data.block_size`
- **Type**: Integer
- **Required**: No
- **Default**: `4096`
- **Description**: Maximum sequence length for training
- **Used in**: Lines ~180 in `run_cpt.py`
- **Implementation**:
  ```python
  block_size = int(data_cfg.get("block_size", 2048))
  # Used in grouping texts into blocks
  for i in range(0, full_len, block_size):
      chunk = concatenated["input_ids"][i:i + block_size]
  ```
- **Example Values**: `2048`, `4096`, `8192`

### `data.shuffle`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Whether to shuffle training data
- **Used in**: Lines ~235 in `run_cpt.py`
- **Implementation**:
  ```python
  if shuffle:
      tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
  ```
- **Example Values**: `true`, `false`

### `data.num_proc`
- **Type**: Integer
- **Required**: No
- **Default**: `4`
- **Description**: Number of processes for data loading
- **Used in**: Lines ~200, ~210 in `run_cpt.py`
- **Implementation**:
  ```python
  num_proc = int(data_cfg.get("num_proc", 4))
  tokenized_train = dsd["train"].map(
      tokenize_fn,
      batched=True,
      num_proc=num_proc,
      ...
  )
  ```
- **Example Values**: `1`, `4`, `8`, `16`

### `data.pack_mode`
- **Type**: String
- **Required**: No
- **Default**: `"pad"`
- **Description**: How to handle remainder tokens in final block
- **Used in**: Lines ~150-230 in `run_cpt.py`
- **Implementation**:
  ```python
  pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
  if pack_mode == "pad":
      # Pad remainder and mask loss
      labels[-pad_len:] = [-100] * pad_len
  # If "drop": ignore remainder entirely
  ```
- **Example Values**:
  - `"drop"` - Drop incomplete blocks (strict CPT)
  - `"pad"` - Pad incomplete blocks with masked loss

---

## PEFT Parameters

### `peft.enabled`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
- **Used in**: Lines ~395 in `run_cpt.py`
- **Implementation**:
  ```python
  if not bool(peft_cfg.get("enabled", True)):
      return model, None
  # Otherwise proceed with LoRA configuration
  ```
- **Example Values**: `true`, `false`

### `peft.r`
- **Type**: Integer
- **Required**: No
- **Default**: `64`
- **Description**: LoRA rank - dimension of low-rank matrices
- **Used in**: Lines ~415 in `run_cpt.py`
- **Implementation**:
  ```python
  lora_config = LoraConfig(
      r=int(peft_cfg.get("r", 16)),
      ...
  )
  ```
- **Example Values**: `8`, `16`, `32`, `64`, `128`
- **Note**: Higher values = more parameters but potentially better performance

### `peft.lora_alpha`
- **Type**: Integer
- **Required**: No
- **Default**: `128`
- **Description**: LoRA alpha scaling parameter
- **Used in**: Lines ~416 in `run_cpt.py`
- **Implementation**:
  ```python
  lora_config = LoraConfig(
      lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
      ...
  )
  ```
- **Example Values**: `16`, `32`, `64`, `128`, `256`

### `peft.lora_dropout`
- **Type**: Float
- **Required**: No
- **Default**: `0.05`
- **Description**: Dropout rate for LoRA layers
- **Used in**: Lines ~417 in `run_cpt.py`
- **Implementation**:
  ```python
  lora_config = LoraConfig(
      lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
      ...
  )
  ```
- **Example Values**: `0.0`, `0.05`, `0.1`, `0.2`

### `peft.bias`
- **Type**: String
- **Required**: No
- **Default**: `"none"`
- **Description**: Bias training strategy
- **Used in**: Lines ~418 in `run_cpt.py`
- **Implementation**:
  ```python
  lora_config = LoraConfig(
      bias=str(peft_cfg.get("bias", "none")),
      ...
  )
  ```
- **Example Values**:
  - `"none"` - No bias training
  - `"all"` - Train all biases
  - `"lora_only"` - Only LoRA bias

### `peft.target_modules`
- **Type**: String or List
- **Required**: No
- **Default**: `"auto"`
- **Description**: Which modules to apply LoRA to
- **Used in**: Lines ~405, ~140-170 in `run_cpt.py`
- **Implementation**:
  ```python
  target_modules = peft_cfg.get("target_modules", "auto")
  if target_modules == "auto":
      target_modules = _infer_target_modules(model)
  ```
- **Example Values**:
  - `"auto"` - Automatic detection
  - `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list
  - `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only

---

## Training Parameters

### `train.num_train_epochs`
- **Type**: Float
- **Required**: No
- **Default**: `2`
- **Description**: Number of epochs to train
- **Used in**: Lines ~470 in `run_cpt.py`
- **Implementation**:
  ```python
  num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
  # Used in TrainingArguments
  ```
- **Example Values**: `1.0`, `2.0`, `3.5`

### `train.per_device_train_batch_size`
- **Type**: Integer
- **Required**: No
- **Default**: `1`
- **Description**: Training batch size per device
- **Used in**: Lines ~475 in `run_cpt.py`
- **Implementation**:
  ```python
  per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))
  ```
- **Example Values**: `1`, `2`, `4`, `8`

### `train.per_device_eval_batch_size`
- **Type**: Integer
- **Required**: No
- **Default**: Same as train batch size
- **Description**: Evaluation batch size per device
- **Used in**: Lines ~476 in `run_cpt.py`
- **Implementation**:
  ```python
  per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))
  ```
- **Example Values**: `1`, `2`, `4`, `8`

### `train.gradient_accumulation_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `16`
- **Description**: Number of steps to accumulate gradients
- **Used in**: Lines ~477 in `run_cpt.py`
- **Implementation**:
  ```python
  gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))
  ```
- **Example Values**: `1`, `4`, `8`, `16`, `32`

### `train.learning_rate`
- **Type**: Float
- **Required**: No
- **Default**: `2e-5`
- **Description**: Learning rate for optimizer
- **Used in**: Lines ~478 in `run_cpt.py`
- **Implementation**:
  ```python
  learning_rate=float(tr_cfg.get("learning_rate", 2e-5))
  ```
- **Example Values**: `1e-5`, `2e-5`, `5e-5`, `1e-4`

### `train.weight_decay`
- **Type**: Float
- **Required**: No
- **Default**: `0.0`
- **Description**: Weight decay for regularization
- **Used in**: Lines ~479 in `run_cpt.py`
- **Implementation**:
  ```python
  weight_decay=float(tr_cfg.get("weight_decay", 0.0))
  ```
- **Example Values**: `0.0`, `0.01`, `0.1`

### `train.warmup_ratio`
- **Type**: Float
- **Required**: No
- **Default**: `0.1`
- **Description**: Ratio of steps for learning rate warmup
- **Used in**: Lines ~480 in `run_cpt.py`
- **Implementation**:
  ```python
  warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))
  ```
- **Example Values**: `0.0`, `0.1`, `0.2`

### `train.lr_scheduler_type`
- **Type**: String
- **Required**: No
- **Default**: `"cosine"`
- **Description**: Learning rate scheduler type
- **Used in**: Lines ~481 in `run_cpt.py`
- **Implementation**:
  ```python
  lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))
  ```
- **Example Values**:
  - `"cosine"` - Cosine annealing
  - `"linear"` - Linear decay
  - `"constant"` - Constant rate
  - `"polynomial"` - Polynomial decay

### `train.optim`
- **Type**: String
- **Required**: No
- **Default**: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise)
- **Description**: Optimizer type
- **Used in**: Lines ~482 in `run_cpt.py`
- **Implementation**:
  ```python
  optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))
  ```
- **Example Values**:
  - `"adamw_torch"` - AdamW (standard)
  - `"paged_adamw_8bit"` - Paged AdamW for 8-bit training
  - `"sgd"` - SGD
  - `"adafactor"` - Adafactor

### `train.max_grad_norm`
- **Type**: Float
- **Required**: No
- **Default**: `1.0`
- **Description**: Maximum gradient norm for clipping
- **Used in**: Lines ~483 in `run_cpt.py`
- **Implementation**:
  ```python
  max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))
  ```
- **Example Values**: `0.5`, `1.0`, `2.0`

### `train.gradient_checkpointing`
- **Type**: Boolean
- **Required**: No
- **Default**: `true`
- **Description**: Use gradient checkpointing to save memory
- **Used in**: Lines ~396-400 in `run_cpt.py`
- **Implementation**:
  ```python
  gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
  if gradient_checkpointing:
      model.gradient_checkpointing_enable()
  ```
- **Example Values**: `true`, `false`

### `train.logging_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `1`
- **Description**: Log training progress every N steps
- **Used in**: Lines ~485 in `run_cpt.py`
- **Implementation**:
  ```python
  logging_steps=int(tr_cfg.get("logging_steps", 10))
  ```
- **Example Values**: `1`, `10`, `50`, `100`

### `train.save_strategy`
- **Type**: String
- **Required**: No
- **Default**: `"steps"`
- **Description**: When to save model checkpoints
- **Used in**: Lines ~487 in `run_cpt.py`
- **Implementation**:
  ```python
  save_strategy=str(tr_cfg.get("save_strategy", "steps"))
  ```
- **Example Values**:
  - `"steps"` - Save every N steps
  - `"epochs"` - Save every epoch
  - `"no"` - Don't save

### `train.save_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `100`
- **Description**: Save checkpoint every N steps
- **Used in**: Lines ~488 in `run_cpt.py`
- **Implementation**:
  ```python
  save_steps=int(tr_cfg.get("save_steps", 200))
  ```
- **Example Values**: `50`, `100`, `200`, `500`

### `train.save_total_limit`
- **Type**: Integer
- **Required**: No
- **Default**: `4`
- **Description**: Maximum number of checkpoints to keep
- **Used in**: Lines ~489 in `run_cpt.py`
- **Implementation**:
  ```python
  save_total_limit=int(tr_cfg.get("save_total_limit", 3))
  ```
- **Example Values**: `1`, `2`, `3`, `5`

### `train.evaluation_strategy`
- **Type**: String
- **Required**: No
- **Default**: `"steps"` (if eval data), `"no"` (otherwise)
- **Description**: When to evaluate model
- **Used in**: Lines ~494 in `run_cpt.py`
- **Implementation**:
  ```python
  evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
  ```
- **Example Values**:
  - `"steps"` - Evaluate every N steps
  - `"epochs"` - Evaluate every epoch
  - `"no"` - Don't evaluate

### `train.eval_steps`
- **Type**: Integer
- **Required**: No
- **Default**: `50`
- **Description**: Evaluate every N steps
- **Used in**: Lines ~491 in `run_cpt.py`
- **Implementation**:
  ```python
  eval_steps=int(tr_cfg.get("eval_steps", 200))
  ```
- **Example Values**: `25`, `50`, `100`, `200`

### `train.load_best_model_at_end`
- **Type**: Boolean
- **Required**: No
- **Default**: `true` (if eval data), `false` (otherwise)
- **Description**: Load best model at end of training
- **Used in**: Lines ~492-493 in `run_cpt.py`
- **Implementation**:
  ```python
  load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False
  ```
- **Example Values**: `true`, `false`

### `train.resume_from_checkpoint`
- **Type**: String
- **Required**: No
- **Default**: `"auto"`
- **Description**: Resume training from checkpoint
- **Used in**: Lines ~510-520 in `run_cpt.py`
- **Implementation**:
  ```python
  resume_from = tr_cfg.get("resume_from_checkpoint", None)
  if resume_from == "auto":
      last = get_last_checkpoint(str(ckpt_dir))
      resume_from = last if last else None
  ```
- **Example Values**:
  - `"auto"` - Auto-detect latest checkpoint
  - `"checkpoint-100"` - Specific checkpoint
  - `null` - Start from scratch

---

## Merge Parameters

### `merge.enabled`
- **Type**: Boolean
- **Required**: No
- **Default**: `false`
- **Description**: Whether to merge LoRA adapters with base model
- **Used in**: Lines ~545 in `run_cpt.py`
- **Implementation**:
  ```python
  if bool(cfg.get("merge", {}).get("enabled", False)):
      merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
  ```
- **Example Values**: `true`, `false`

### `merge.merged_dtype`
- **Type**: String
- **Required**: No
- **Default**: `"float16"`
- **Description**: Data type for merged model
- **Used in**: Lines ~430 in `run_cpt.py`
- **Implementation**:
  ```python
  merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
  ```
- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`

### `merge.max_shard_size`
- **Type**: String
- **Required**: No
- **Default**: `"2GB"`
- **Description**: Maximum size per shard when saving
- **Used in**: Lines ~445 in `run_cpt.py`
- **Implementation**:
  ```python
  merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
  ```
- **Example Values**: `"1GB"`, `"2GB"`, `"5GB"`

### `merge.output_dir`
- **Type**: String (path)
- **Required**: No
- **Default**: `"./merged_model"`
- **Description**: Directory for merged model output
- **Used in**: Lines ~505-510 in `run_cpt.py`
- **Implementation**:
  ```python
  if merge_cfg.get("output_dir"):
      od = Path(str(merge_cfg["output_dir"]))
      final_dir = od if od.is_absolute() else (run_dir / od)
  else:
      final_dir = run_dir / "final_model"
  ```
- **Example Values**: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"`

---

## Parameter Dependencies and Interactions

### Memory-Related Dependencies
- `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size
- `block_size` affects memory usage significantly
- `use_4bit` + `bnb_4bit_*` parameters work together for quantization
- `gradient_checkpointing` can enable larger `block_size` or `batch_size`

### Training Strategy Dependencies
- `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0`
- `load_best_model_at_end` requires `evaluation_strategy` to be enabled
- `save_strategy` should be compatible with `evaluation_strategy`
- `lr_scheduler_type` affects warmup calculations

### Model-Specific Dependencies
- `target_modules` must match the actual module names in your model
- `torch_dtype` should be compatible with your GPU hardware
- `device_map` affects whether you can use certain optimizations

### Data Processing Dependencies
- `text_field` must exist in your JSONL data
- `pack_mode: "pad"` requires `block_size` to be set appropriately
- `eval_split_ratio` is ignored if `eval_jsonl` is provided

This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.