task2file-llm / trainer-kit /CPT /detailed_parameter_documentation.md
SirajRLX's picture
Add Training Scripts
e527a65 verified

CPT Configuration Parameters: Detailed Guide

This document provides a comprehensive explanation of all configuration parameters in config.yaml and how they're implemented in run_cpt.py.

Table of Contents


Run Parameters

run.run_dir

  • Type: String (path)
  • Required: Yes
  • Default: No default
  • Description: Directory where training outputs will be saved
  • Used in: Line ~480 in run_cpt.py
  • Implementation:
    run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
    
  • Example Values:
    • ./runs/cpt_run_v1
    • /workspace/outputs/my_experiment
    • ./checkpoints/cpt_experiment

run.seed

  • Type: Integer
  • Required: No
  • Default: None
  • Description: Random seed for reproducibility
  • Used in: Lines ~460, ~240 in run_cpt.py
  • Implementation:
    set_seed(int(cfg["run"].get("seed", 42)))
    # Used in data shuffling and train/test split
    
  • Example Values: 42, 123, 2023

Model Parameters

model.repo_id

  • Type: String (path or HuggingFace repo)
  • Required: Yes
  • Default: No default
  • Description: Model identifier - can be local path or HuggingFace repository
  • Used in: Lines ~480-500 in run_cpt.py
  • Implementation:
    repo_id = str(model_cfg["repo_id"]).strip()
    repo_path = Path(repo_id)
    if repo_path.exists() and repo_path.is_dir():
        base_dir = repo_path  # Local path
    else:
        # Download from HuggingFace
        snapshot_download(repo_id=repo_id, ...)
    
  • Example Values:
    • Local: /workspace/Models/Devstral-Small-2-24B-Instruct-2512
    • HF Repo: meta-llama/Llama-2-7b-hf

model.revision

  • Type: String or null
  • Required: No
  • Default: null
  • Description: Specific model revision/branch/tag from HuggingFace
  • Used in: Line ~495 in run_cpt.py
  • Implementation:
    snapshot_download(
        repo_id=repo_id,
        revision=model_cfg.get("revision", None),
        ...
    )
    
  • Example Values:
    • "main" - Main branch
    • "v1.0" - Specific tag
    • "abc123def" - Specific commit hash
    • null - Latest version

model.base_local_dir

  • Type: String (path)
  • Required: No
  • Default: "base_model"
  • Description: Directory name for downloaded model when using HF repo
  • Used in: Line ~495 in run_cpt.py
  • Implementation:
    base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
    
  • Example Values: "base_model", "downloaded_model", "model_files"

model.trust_remote_code

  • Type: Boolean
  • Required: No
  • Default: true
  • Description: Allow loading models with custom code
  • Used in: Lines ~320, ~340, ~450 in run_cpt.py
  • Implementation:
    tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
    model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)
    
  • Example Values: true, false

model.tokenizer_use_fast

  • Type: Boolean
  • Required: No
  • Default: true
  • Description: Use fast tokenizer implementation
  • Used in: Lines ~320, ~450 in run_cpt.py
  • Implementation:
    tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
    
  • Example Values: true, false

model.device_map

  • Type: String
  • Required: No
  • Default: "auto"
  • Description: How to distribute model across devices
  • Used in: Lines ~350, ~370 in run_cpt.py
  • Implementation:
    model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)
    
  • Example Values:
    • "auto" - Automatic distribution
    • "cpu" - CPU only
    • "cuda:0" - Single GPU
    • {"": 0} - Manual mapping

model.torch_dtype

  • Type: String
  • Required: No
  • Default: "bfloat16"
  • Description: Data type for model tensors
  • Used in: Lines ~45, ~350 in run_cpt.py
  • Implementation:
    def _dtype_from_str(s: str) -> torch.dtype:
        if s in ("float16", "fp16"): return torch.float16
        if s in ("bfloat16", "bf16"): return torch.bfloat16  
        if s in ("float32", "fp32"): return torch.float32
    
  • Example Values:
    • "float16" - 16-bit floats (faster, less memory, less stable)
    • "bfloat16" - Brain float16 (stable, good for training)
    • "float32" - 32-bit floats (slowest, most memory)

model.use_4bit

  • Type: Boolean
  • Required: No
  • Default: false
  • Description: Use 4-bit quantization for memory efficiency
  • Used in: Lines ~325, ~395 in run_cpt.py
  • Implementation:
    use_4bit = bool(model_cfg.get("use_4bit", False))
    if use_4bit:
        quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)
    
  • Example Values: true, false

model.bnb_4bit_quant_type

  • Type: String
  • Required: No
  • Default: "nf4"
  • Description: 4-bit quantization type
  • Used in: Lines ~328 in run_cpt.py
  • Implementation:
    bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))
    
  • Example Values:
    • "nf4" - NormalFloat4 (recommended)
    • "fp4" - FloatingPoint4
    • "int4" - Integer4

model.bnb_4bit_use_double_quant

  • Type: Boolean
  • Required: No
  • Default: false
  • Description: Use double quantization for memory efficiency
  • Used in: Lines ~329 in run_cpt.py
  • Implementation:
    bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))
    
  • Example Values: true, false

model.bnb_4bit_compute_dtype

  • Type: String
  • Required: No
  • Default: "bfloat16"
  • Description: Compute dtype for 4-bit quantization
  • Used in: Lines ~330 in run_cpt.py
  • Implementation:
    bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))
    
  • Example Values: "float16", "bfloat16", "float32"

model.attn_implementation

  • Type: String or null
  • Required: No
  • Default: null
  • Description: Attention implementation to use
  • Used in: Lines ~155, ~350 in run_cpt.py
  • Implementation:
    def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
        return cfg.get("model", {}).get("attn_implementation", None)
    # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)
    
  • Example Values:
    • "flash_attention_2" - Flash Attention 2 (fastest)
    • "sdpa" - Scaled Dot-Product Attention
    • null - Default implementation

Data Parameters

data.train_jsonl

  • Type: String (path)
  • Required: Yes
  • Default: No default
  • Description: Path to training data in JSONL format
  • Used in: Lines ~170 in run_cpt.py
  • Implementation:
    train_path = data_cfg["train_jsonl"]
    ds = load_dataset("json", data_files={"train": train_path})
    
  • Example Values: "/workspace/all_data_with_descriptions.jsonl"

data.eval_jsonl

  • Type: String (path) or null
  • Required: No
  • Default: null
  • Description: Path to evaluation data in JSONL format
  • Used in: Lines ~175 in run_cpt.py
  • Implementation:
    eval_path = data_cfg.get("eval_jsonl", None)
    if eval_path:
        ds_eval = load_dataset("json", data_files={"eval": eval_path})
    
  • Example Values: null (no separate eval file), "/workspace/eval_data.jsonl"

data.eval_split_ratio

  • Type: Float
  • Required: No
  • Default: 0.1
  • Description: Ratio of training data to use for evaluation split
  • Used in: Lines ~177 in run_cpt.py
  • Implementation:
    split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
    if 0.0 < split_ratio < 1.0:
        split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)
    
  • Example Values: 0.1 (10%), 0.2 (20%), 0.05 (5%)

data.text_field

  • Type: String
  • Required: No
  • Default: "text"
  • Description: Field name in JSONL containing the text data
  • Used in: Lines ~185 in run_cpt.py
  • Implementation:
    text_field = data_cfg.get("text_field", "text")
    # Used in tokenization
    tokenized = dsd["train"].map(
        tokenize_fn,
        batched=True,
        remove_columns=dsd["train"].column_names,
        desc="Tokenizing train",
    )
    
  • Example Values: "text", "content", "prompt", "input"

data.block_size

  • Type: Integer
  • Required: No
  • Default: 4096
  • Description: Maximum sequence length for training
  • Used in: Lines ~180 in run_cpt.py
  • Implementation:
    block_size = int(data_cfg.get("block_size", 2048))
    # Used in grouping texts into blocks
    for i in range(0, full_len, block_size):
        chunk = concatenated["input_ids"][i:i + block_size]
    
  • Example Values: 2048, 4096, 8192

data.shuffle

  • Type: Boolean
  • Required: No
  • Default: true
  • Description: Whether to shuffle training data
  • Used in: Lines ~235 in run_cpt.py
  • Implementation:
    if shuffle:
        tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
    
  • Example Values: true, false

data.num_proc

  • Type: Integer
  • Required: No
  • Default: 4
  • Description: Number of processes for data loading
  • Used in: Lines ~200, ~210 in run_cpt.py
  • Implementation:
    num_proc = int(data_cfg.get("num_proc", 4))
    tokenized_train = dsd["train"].map(
        tokenize_fn,
        batched=True,
        num_proc=num_proc,
        ...
    )
    
  • Example Values: 1, 4, 8, 16

data.pack_mode

  • Type: String
  • Required: No
  • Default: "pad"
  • Description: How to handle remainder tokens in final block
  • Used in: Lines ~150-230 in run_cpt.py
  • Implementation:
    pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
    if pack_mode == "pad":
        # Pad remainder and mask loss
        labels[-pad_len:] = [-100] * pad_len
    # If "drop": ignore remainder entirely
    
  • Example Values:
    • "drop" - Drop incomplete blocks (strict CPT)
    • "pad" - Pad incomplete blocks with masked loss

PEFT Parameters

peft.enabled

  • Type: Boolean
  • Required: No
  • Default: true
  • Description: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
  • Used in: Lines ~395 in run_cpt.py
  • Implementation:
    if not bool(peft_cfg.get("enabled", True)):
        return model, None
    # Otherwise proceed with LoRA configuration
    
  • Example Values: true, false

peft.r

  • Type: Integer
  • Required: No
  • Default: 64
  • Description: LoRA rank - dimension of low-rank matrices
  • Used in: Lines ~415 in run_cpt.py
  • Implementation:
    lora_config = LoraConfig(
        r=int(peft_cfg.get("r", 16)),
        ...
    )
    
  • Example Values: 8, 16, 32, 64, 128
  • Note: Higher values = more parameters but potentially better performance

peft.lora_alpha

  • Type: Integer
  • Required: No
  • Default: 128
  • Description: LoRA alpha scaling parameter
  • Used in: Lines ~416 in run_cpt.py
  • Implementation:
    lora_config = LoraConfig(
        lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
        ...
    )
    
  • Example Values: 16, 32, 64, 128, 256

peft.lora_dropout

  • Type: Float
  • Required: No
  • Default: 0.05
  • Description: Dropout rate for LoRA layers
  • Used in: Lines ~417 in run_cpt.py
  • Implementation:
    lora_config = LoraConfig(
        lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
        ...
    )
    
  • Example Values: 0.0, 0.05, 0.1, 0.2

peft.bias

  • Type: String
  • Required: No
  • Default: "none"
  • Description: Bias training strategy
  • Used in: Lines ~418 in run_cpt.py
  • Implementation:
    lora_config = LoraConfig(
        bias=str(peft_cfg.get("bias", "none")),
        ...
    )
    
  • Example Values:
    • "none" - No bias training
    • "all" - Train all biases
    • "lora_only" - Only LoRA bias

peft.target_modules

  • Type: String or List
  • Required: No
  • Default: "auto"
  • Description: Which modules to apply LoRA to
  • Used in: Lines ~405, ~140-170 in run_cpt.py
  • Implementation:
    target_modules = peft_cfg.get("target_modules", "auto")
    if target_modules == "auto":
        target_modules = _infer_target_modules(model)
    
  • Example Values:
    • "auto" - Automatic detection
    • ["q_proj", "k_proj", "v_proj", "o_proj"] - Explicit list
    • ["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"] - MLP only

Training Parameters

train.num_train_epochs

  • Type: Float
  • Required: No
  • Default: 2
  • Description: Number of epochs to train
  • Used in: Lines ~470 in run_cpt.py
  • Implementation:
    num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
    # Used in TrainingArguments
    
  • Example Values: 1.0, 2.0, 3.5

train.per_device_train_batch_size

  • Type: Integer
  • Required: No
  • Default: 1
  • Description: Training batch size per device
  • Used in: Lines ~475 in run_cpt.py
  • Implementation:
    per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))
    
  • Example Values: 1, 2, 4, 8

train.per_device_eval_batch_size

  • Type: Integer
  • Required: No
  • Default: Same as train batch size
  • Description: Evaluation batch size per device
  • Used in: Lines ~476 in run_cpt.py
  • Implementation:
    per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))
    
  • Example Values: 1, 2, 4, 8

train.gradient_accumulation_steps

  • Type: Integer
  • Required: No
  • Default: 16
  • Description: Number of steps to accumulate gradients
  • Used in: Lines ~477 in run_cpt.py
  • Implementation:
    gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))
    
  • Example Values: 1, 4, 8, 16, 32

train.learning_rate

  • Type: Float
  • Required: No
  • Default: 2e-5
  • Description: Learning rate for optimizer
  • Used in: Lines ~478 in run_cpt.py
  • Implementation:
    learning_rate=float(tr_cfg.get("learning_rate", 2e-5))
    
  • Example Values: 1e-5, 2e-5, 5e-5, 1e-4

train.weight_decay

  • Type: Float
  • Required: No
  • Default: 0.0
  • Description: Weight decay for regularization
  • Used in: Lines ~479 in run_cpt.py
  • Implementation:
    weight_decay=float(tr_cfg.get("weight_decay", 0.0))
    
  • Example Values: 0.0, 0.01, 0.1

train.warmup_ratio

  • Type: Float
  • Required: No
  • Default: 0.1
  • Description: Ratio of steps for learning rate warmup
  • Used in: Lines ~480 in run_cpt.py
  • Implementation:
    warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))
    
  • Example Values: 0.0, 0.1, 0.2

train.lr_scheduler_type

  • Type: String
  • Required: No
  • Default: "cosine"
  • Description: Learning rate scheduler type
  • Used in: Lines ~481 in run_cpt.py
  • Implementation:
    lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))
    
  • Example Values:
    • "cosine" - Cosine annealing
    • "linear" - Linear decay
    • "constant" - Constant rate
    • "polynomial" - Polynomial decay

train.optim

  • Type: String
  • Required: No
  • Default: "paged_adamw_8bit" (if 4-bit), "adamw_torch" (otherwise)
  • Description: Optimizer type
  • Used in: Lines ~482 in run_cpt.py
  • Implementation:
    optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))
    
  • Example Values:
    • "adamw_torch" - AdamW (standard)
    • "paged_adamw_8bit" - Paged AdamW for 8-bit training
    • "sgd" - SGD
    • "adafactor" - Adafactor

train.max_grad_norm

  • Type: Float
  • Required: No
  • Default: 1.0
  • Description: Maximum gradient norm for clipping
  • Used in: Lines ~483 in run_cpt.py
  • Implementation:
    max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))
    
  • Example Values: 0.5, 1.0, 2.0

train.gradient_checkpointing

  • Type: Boolean
  • Required: No
  • Default: true
  • Description: Use gradient checkpointing to save memory
  • Used in: Lines ~396-400 in run_cpt.py
  • Implementation:
    gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
    if gradient_checkpointing:
        model.gradient_checkpointing_enable()
    
  • Example Values: true, false

train.logging_steps

  • Type: Integer
  • Required: No
  • Default: 1
  • Description: Log training progress every N steps
  • Used in: Lines ~485 in run_cpt.py
  • Implementation:
    logging_steps=int(tr_cfg.get("logging_steps", 10))
    
  • Example Values: 1, 10, 50, 100

train.save_strategy

  • Type: String
  • Required: No
  • Default: "steps"
  • Description: When to save model checkpoints
  • Used in: Lines ~487 in run_cpt.py
  • Implementation:
    save_strategy=str(tr_cfg.get("save_strategy", "steps"))
    
  • Example Values:
    • "steps" - Save every N steps
    • "epochs" - Save every epoch
    • "no" - Don't save

train.save_steps

  • Type: Integer
  • Required: No
  • Default: 100
  • Description: Save checkpoint every N steps
  • Used in: Lines ~488 in run_cpt.py
  • Implementation:
    save_steps=int(tr_cfg.get("save_steps", 200))
    
  • Example Values: 50, 100, 200, 500

train.save_total_limit

  • Type: Integer
  • Required: No
  • Default: 4
  • Description: Maximum number of checkpoints to keep
  • Used in: Lines ~489 in run_cpt.py
  • Implementation:
    save_total_limit=int(tr_cfg.get("save_total_limit", 3))
    
  • Example Values: 1, 2, 3, 5

train.evaluation_strategy

  • Type: String
  • Required: No
  • Default: "steps" (if eval data), "no" (otherwise)
  • Description: When to evaluate model
  • Used in: Lines ~494 in run_cpt.py
  • Implementation:
    evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
    
  • Example Values:
    • "steps" - Evaluate every N steps
    • "epochs" - Evaluate every epoch
    • "no" - Don't evaluate

train.eval_steps

  • Type: Integer
  • Required: No
  • Default: 50
  • Description: Evaluate every N steps
  • Used in: Lines ~491 in run_cpt.py
  • Implementation:
    eval_steps=int(tr_cfg.get("eval_steps", 200))
    
  • Example Values: 25, 50, 100, 200

train.load_best_model_at_end

  • Type: Boolean
  • Required: No
  • Default: true (if eval data), false (otherwise)
  • Description: Load best model at end of training
  • Used in: Lines ~492-493 in run_cpt.py
  • Implementation:
    load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False
    
  • Example Values: true, false

train.resume_from_checkpoint

  • Type: String
  • Required: No
  • Default: "auto"
  • Description: Resume training from checkpoint
  • Used in: Lines ~510-520 in run_cpt.py
  • Implementation:
    resume_from = tr_cfg.get("resume_from_checkpoint", None)
    if resume_from == "auto":
        last = get_last_checkpoint(str(ckpt_dir))
        resume_from = last if last else None
    
  • Example Values:
    • "auto" - Auto-detect latest checkpoint
    • "checkpoint-100" - Specific checkpoint
    • null - Start from scratch

Merge Parameters

merge.enabled

  • Type: Boolean
  • Required: No
  • Default: false
  • Description: Whether to merge LoRA adapters with base model
  • Used in: Lines ~545 in run_cpt.py
  • Implementation:
    if bool(cfg.get("merge", {}).get("enabled", False)):
        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
    
  • Example Values: true, false

merge.merged_dtype

  • Type: String
  • Required: No
  • Default: "float16"
  • Description: Data type for merged model
  • Used in: Lines ~430 in run_cpt.py
  • Implementation:
    merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
    
  • Example Values: "float16", "bfloat16", "float32"

merge.max_shard_size

  • Type: String
  • Required: No
  • Default: "2GB"
  • Description: Maximum size per shard when saving
  • Used in: Lines ~445 in run_cpt.py
  • Implementation:
    merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
    
  • Example Values: "1GB", "2GB", "5GB"

merge.output_dir

  • Type: String (path)
  • Required: No
  • Default: "./merged_model"
  • Description: Directory for merged model output
  • Used in: Lines ~505-510 in run_cpt.py
  • Implementation:
    if merge_cfg.get("output_dir"):
        od = Path(str(merge_cfg["output_dir"]))
        final_dir = od if od.is_absolute() else (run_dir / od)
    else:
        final_dir = run_dir / "final_model"
    
  • Example Values: "./merged_model", "/workspace/final_model", "./models/merged"

Parameter Dependencies and Interactions

Memory-Related Dependencies

  • per_device_train_batch_size + gradient_accumulation_steps = effective batch size
  • block_size affects memory usage significantly
  • use_4bit + bnb_4bit_* parameters work together for quantization
  • gradient_checkpointing can enable larger block_size or batch_size

Training Strategy Dependencies

  • evaluation_strategy requires either eval_jsonl or eval_split_ratio > 0
  • load_best_model_at_end requires evaluation_strategy to be enabled
  • save_strategy should be compatible with evaluation_strategy
  • lr_scheduler_type affects warmup calculations

Model-Specific Dependencies

  • target_modules must match the actual module names in your model
  • torch_dtype should be compatible with your GPU hardware
  • device_map affects whether you can use certain optimizations

Data Processing Dependencies

  • text_field must exist in your JSONL data
  • pack_mode: "pad" requires block_size to be set appropriately
  • eval_split_ratio is ignored if eval_jsonl is provided

This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.