SirajRLX commited on Dec 24, 2025

Commit

e527a65

verified ·

1 Parent(s): 89ec0cd

Add Training Scripts

Browse files

Files changed (26) hide show

trainer-kit/.gitignore +55 -0
trainer-kit/CPT-14b/README.md +189 -0
trainer-kit/CPT-14b/README_instruct.md +168 -0
trainer-kit/CPT-14b/commands.md +15 -0
trainer-kit/CPT-14b/config.yaml +91 -0
trainer-kit/CPT-14b/dummy_data.jsonl +6 -0
trainer-kit/CPT-14b/requirements.txt +25 -0
trainer-kit/CPT-14b/run_cpt.py +772 -0
trainer-kit/CPT/README.md +189 -0
trainer-kit/CPT/commands.md +15 -0
trainer-kit/CPT/config.yaml +82 -0
trainer-kit/CPT/detailed_parameter_documentation.md +795 -0
trainer-kit/CPT/dummy_data.jsonl +6 -0
trainer-kit/CPT/requirements.txt +22 -0
trainer-kit/CPT/run_cpt.py +708 -0
trainer-kit/SFT-14b/.DS_Store +0 -0
trainer-kit/SFT-14b/config_instruct.yaml +144 -0
trainer-kit/SFT-14b/instruct_data.jsonl +4 -0
trainer-kit/SFT-14b/requirements.txt +23 -0
trainer-kit/SFT-14b/run_instruct.py +844 -0
trainer-kit/SFT/.DS_Store +0 -0
trainer-kit/SFT/config_instruct.yaml +144 -0
trainer-kit/SFT/instruct_data.jsonl +4 -0
trainer-kit/SFT/requirements.txt +23 -0
trainer-kit/SFT/run_instruct.py +921 -0
trainer-kit/documentation.md +296 -0

trainer-kit/.gitignore ADDED Viewed

	@@ -0,0 +1,55 @@

+# Python
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+pythonenv*
+.pytest_cache/
+ipynb_checkpoints/
+# Virtualenv
+.venv
+venv/
+virtualenv/
+env/
+# IDE
+.vscode/
+.idea/
+*.sublime-workspace
+*.sublime-project
+*.swp
+*.swo
+# Build
+build/
+dist/
+*.egg-info/
+# Data and logs
+data/
+logs/
+*.log
+runs/**
+output/**
+# Jupyter
+.ipynb_checkpoints/
+# Environment
+.env
+.ENV
+.env.bak
+.venv
+venv
+venv.bak
+# OS generated files
+.DS_Store
+Thumbs.db

trainer-kit/CPT-14b/README.md ADDED Viewed

	@@ -0,0 +1,189 @@

+# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge
+Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs.
+It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model.
+---
+## What we built
+### ✅ Core goals implemented
+* **CPT training loop** controlled entirely via a **YAML config**
+* **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id)
+* **JSONL datasets** for train (+ optional eval split)
+* **CPT‑style token stream packing** into fixed‑length blocks
+* **Two packing modes**
+  * `drop`: strict CPT, drop remainder tokens (preferred for real CPT)
+  * `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging)
+* **Checkpointing + resume**
+  * `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints`
+* **JSONL logs** written locally
+  * training logs: `run_dir/logs/train.jsonl`
+  * eval logs: `run_dir/logs/eval.jsonl`
+* **Evaluation**
+  * logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard)
+* **Adapter output**
+  * saves the final/best adapter to `run_dir/best_adapter`
+* **Merge workflow**
+  * `--merge-only` merges an existing adapter later
+  * merge is done **on CPU** to avoid GPU OOM
+  * merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path)
+---
+## Repository layout (outputs)
+A run produces the following structure under `run.run_dir`:
+```
+runs/<run_name>/
+├─ checkpoints/            # trainer checkpoints (for resume)
+├─ best_adapter/           # saved LoRA adapter
+├─ logs/
+│  ├─ train.jsonl          # step-wise training logs
+│  └─ eval.jsonl           # eval logs (eval_loss + perplexity)
+├─ eval_final.json         # final eval metrics summary (if eval is enabled)
+└─ config_resolved.yaml    # exact config used for the run
+```
+If merge is used, the merged model is written to:
+* `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`)
+* or the absolute path if it is absolute.
+---
+## Supported training modes
+### 1) LoRA vs QLoRA (same script)
+* **QLoRA** happens when `model.use_4bit: true`
+  * base weights are loaded in 4‑bit using bitsandbytes
+  * training updates only LoRA parameters
+* **LoRA** happens when `model.use_4bit: false`
+  * base weights are loaded in fp16/bf16 (as configured)
+  * training updates only LoRA parameters
+No “full finetune” mode is enabled by default in this runner.
+---
+## Data pipeline (CPT behavior)
+### Input format
+* JSONL file where each line contains a text field (default `"text"`).
+* Example:
+  * `{"text": "some training text..."}`
+### Packing (token stream → fixed blocks)
+* Each sample is tokenized without truncation.
+* An **EOS token is appended** per document to preserve boundaries.
+* Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`.
+Two modes:
+* **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded.
+* **`pad` (debug/small data):** remainder is padded to block_size:
+  * `attention_mask = 0` for padded positions
+  * `labels = -100` for padded positions (loss masking)
+This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`.
+---
+## Logging
+Trainer‑Kit writes **machine‑readable logs** in JSONL.
+### Training logs (`logs/train.jsonl`)
+Includes entries with:
+* `step`
+* `loss`
+* `grad_norm`
+* `learning_rate`
+* `progress_pct` (step progress when `max_steps` is active)
+* ETA estimation
+### Eval logs (`logs/eval.jsonl`)
+Includes:
+* `eval_loss`
+* `perplexity`
+Notes:
+* When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1).
+  **Use `progress_pct` as the reliable indicator** for step‑based runs.
+---
+## Checkpointing and resume
+The trainer saves checkpoints under:
+* `run_dir/checkpoints/`
+Resume options:
+* `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically
+* `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint
+* `resume_from_checkpoint: null` → fresh run
+---
+## Merging adapters into a final model
+Trainer‑Kit supports exporting a merged model:
+### Merge after training
+* Enable merge in config (`merge.enabled: true`)
+* The script will:
+  1. save the adapter
+  2. free GPU memory
+  3. reload base model on **CPU**
+  4. load adapter
+  5. `merge_and_unload()`
+  6. save final merged model
+### Merge later
+Run:
+```
+python run_cpt.py --config config.yaml --merge-only
+```
+This skips training and merges `run_dir/best_adapter` into the base model.
+---
+## How to run
+### Train
+```
+python run_cpt.py --config config.yaml
+```
+### Merge only
+```
+python run_cpt.py --config config.yaml --merge-only

trainer-kit/CPT-14b/README_instruct.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Instruction Fine-Tuning Script
+This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.
+## Key Differences from CPT
+1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output
+2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
+3. **No Text Packing**: Each example is treated as a complete instruction-response pair
+4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input
+5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens
+## Supported Data Formats
+### JSONL Structure
+Each line should be a JSON object with the following fields:
+```json
+{
+  "instruction": "Your instruction here",
+  "input": "Optional input context (can be empty string)",
+  "output": "Expected response"
+}
+```
+### Formatting Options
+#### 1. ChatML Format (Default)
+Uses the model's chat template with system/user/assistant roles:
+```yaml
+data:
+  format_type: "chatml"
+  system_prompt: "You are a helpful assistant."
+```
+#### 2. Alpaca Format
+Uses the classic Alpaca instruction format:
+```yaml
+data:
+  format_type: "alpaca"
+```
+#### 3. Custom Format
+Define your own template:
+```yaml
+data:
+  format_type: "custom"
+  custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
+```
+## Configuration
+Key configuration options in `config_instruct.yaml`:
+### Data Configuration
+```yaml
+data:
+  train_jsonl: "path/to/your/train.jsonl"
+  eval_jsonl: "path/to/your/eval.jsonl"  # optional
+  eval_split_ratio: 0.1  # if no eval file provided
+  # Field names in your data
+  instruction_field: "instruction"
+  input_field: "input"
+  output_field: "output"
+  # Formatting
+  format_type: "chatml"  # "chatml" | "alpaca" | "custom"
+  system_prompt: "You are a helpful assistant."
+  # Tokenization
+  max_length: 2048
+```
+### Training Configuration
+```yaml
+train:
+  max_steps: 100
+  num_train_epochs: 3
+  per_device_train_batch_size: 1
+  gradient_accumulation_steps: 16
+  learning_rate: 5e-5
+  # ... other training parameters
+```
+## Usage
+### Basic Usage
+```bash
+python run_instruct.py --config config_instruct.yaml
+```
+### Merge Only (after training)
+```bash
+python run_instruct.py --config config_instruct.yaml --merge-only
+```
+## Example Data Format
+See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:
+```json
+{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
+{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}
+{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n    if n < 0:\n        raise ValueError(...)"}
+```
+## Key Features
+1. **Multiple Format Support**: ChatML, Alpaca, and custom templates
+2. **Flexible Field Mapping**: Configure custom field names for your data
+3. **Proper Loss Masking**: Only computes loss on the response portion
+4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA
+5. **Evaluation Support**: Automatic evaluation split or separate eval file
+6. **Checkpointing**: Resume training from checkpoints
+7. **Model Merging**: Merge trained adapters with base model
+## Best Practices
+1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent
+2. **Format Consistency**: Use the same format for training and inference
+3. **System Prompts**: Choose appropriate system prompts for your use case
+4. **Token Length**: Set appropriate `max_length` based on your model and data
+5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory
+## Troubleshooting
+### Common Issues
+1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization
+2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length`
+3. **Poor Quality**: Check data format consistency and quality
+4. **Tokenizer Issues**: Ensure your model has proper chat template support
+### Debug Mode
+Add logging to see formatted examples:
+```python
+# In format_instruction function, add:
+print(f"Formatted: {formatted_text}")
+```
+## File Structure
+```
+CPT/
+├── run_instruct.py          # Main instruction fine-tuning script
+├── config_instruct.yaml     # Configuration file
+├── instruct_data.jsonl      # Example instruction data
+├── README_instruct.md       # This documentation
+└── runs/                    # Training outputs
+    └── instruct_run_v1/
+        ├── logs/
+        ├── checkpoints/
+        ├── best_adapter/
+        └── final_model/
+```
+## Migration from CPT
+To migrate from the original CPT script:
+1. Convert your text data to instruction format
+2. Update your configuration file
+3. Choose appropriate formatting options
+4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)
+The script maintains the same CLI interface and most configuration options for easy migration.

trainer-kit/CPT-14b/commands.md ADDED Viewed

	@@ -0,0 +1,15 @@

+## Commands
+Train (no merge):
+```bash
+python run_cpt.py --config config.yaml
+```
+Merge later:
+```bash
+python run_cpt.py --config config.yaml --merge-only
+```
+---

trainer-kit/CPT-14b/config.yaml ADDED Viewed

	@@ -0,0 +1,91 @@

+run:
+  run_dir: "./runs/cpt_run_14b"
+  seed: 42
+# WandB integration for experiment tracking
+wandb:
+  enabled: true  # Set to true to enable wandb logging
+  project: "cpt-training"  # WandB project name
+  entity: null  # WandB entity/team (optional)
+  name: null  # Run name (optional, will auto-generate if null)
+  tags: ["cpt-lora","sft-14b"]  # List of tags for the run (e.g., ["lora", "qlora", "experiment-1"])
+  notes: null  # Run description/notes (optional)
+model:
+  # Local model path (no download)
+  repo_id: "/workspace/Models/Qwen2.5-Coder-14B"
+  revision: null
+  # Used only when repo_id is a HF repo (not a local path)
+  base_local_dir: "base_model"
+  trust_remote_code: true
+  tokenizer_use_fast: true
+  device_map: "auto"
+  torch_dtype: "bfloat16"  # "float16" | "bfloat16" | "float32"
+  # QLoRA
+  use_4bit: false
+  bnb_4bit_quant_type: "nf4"
+  bnb_4bit_use_double_quant: false
+  bnb_4bit_compute_dtype: "bfloat16"
+  # optional: "flash_attention_2" | "sdpa" | null
+  attn_implementation: null
+data:
+  train_jsonl: "all_data_with_descriptions.jsonl"
+  eval_jsonl: null
+  eval_split_ratio: 0.1
+  text_field: "text"
+  block_size: 4096
+  shuffle: true
+  num_proc: 4
+  # ✅ NEW: packing behavior
+  # "drop" = strict CPT (drop remainder)
+  # "pad"  = pad remainder to block_size + loss mask (-100) + attention_mask=0
+  pack_mode: "pad"
+peft:
+  enabled: true
+  r: 32
+  lora_alpha: 64
+  lora_dropout: 0.05
+  bias: "none"
+  target_modules: "auto"
+train:
+  # max_steps: 1000
+  num_train_epochs: 2
+  per_device_train_batch_size: 1
+  per_device_eval_batch_size: 1
+  gradient_accumulation_steps: 16
+  learning_rate: 2e-5
+  weight_decay: 0.0
+  warmup_ratio: 0.1
+  lr_scheduler_type: "cosine"
+  optim: "paged_adamw_8bit"
+  max_grad_norm: 1.0
+  gradient_checkpointing: true
+  logging_steps: 1
+  save_strategy: "steps"
+  save_steps: 100
+  save_total_limit: 7
+  evaluation_strategy: "steps"
+  eval_steps: 50
+  load_best_model_at_end: true
+  resume_from_checkpoint: "auto"
+merge:
+  enabled: true
+  merged_dtype: "float16"
+  max_shard_size: "2GB"
+  output_dir: "./merged_14b_cpt_lora"

trainer-kit/CPT-14b/dummy_data.jsonl ADDED Viewed

	@@ -0,0 +1,6 @@

+{"text": "This is a test sentence for the dummy dataset."}
+{"text": "Another sentence to check if training works."}
+{"text": "We need enough data to form a batch."}
+{"text": "FSDP and LoRA are cool technologies."}
+{"text": "Fine-tuning LLMs is fun and useful."}
+{"text": "This is the end of the dummy dataset."}

trainer-kit/CPT-14b/requirements.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+# Core
+torch>=2.1.0
+transformers>=4.41.0
+datasets>=2.18.0
+accelerate>=0.30.0
+# PEFT / QLoRA
+peft>=0.11.1
+bitsandbytes>=0.43.1
+# Hugging Face Hub (local + download support)
+huggingface_hub>=0.23.0
+# Config + utilities
+pyyaml>=6.0
+tqdm>=4.66.0
+# Optional but recommended (tokenizers speed)
+tokenizers>=0.15.0
+safetensors>=0.4.2
+# Optional (for eval)
+rouge-score>=0.1.2
+# Experiment tracking
+wandb>=0.16.0

trainer-kit/CPT-14b/run_cpt.py ADDED Viewed

	@@ -0,0 +1,772 @@

+import argparse
+import json
+import inspect  # Added for Transformers version compatibility
+import math
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple, List
+import torch
+import yaml
+from datasets import load_dataset, DatasetDict
+from huggingface_hub import snapshot_download
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedTokenizerFast,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from peft import (
+    LoraConfig,
+    get_peft_model,
+    prepare_model_for_kbit_training,
+    PeftModel,
+)
+try:
+    from transformers import BitsAndBytesConfig
+except ImportError:  # older transformers
+    BitsAndBytesConfig = None
+try:
+    import wandb
+    WANDB_AVAILABLE = True
+except ImportError:
+    WANDB_AVAILABLE = False
+    wandb = None
+# --------------------------
+# Helpers
+# --------------------------
+def _dtype_from_str(s: str) -> torch.dtype:
+    s = (s or "").lower()
+    if s in ("float16", "fp16"):
+        return torch.float16
+    if s in ("bfloat16", "bf16"):
+        return torch.bfloat16
+    if s in ("float32", "fp32"):
+        return torch.float32
+    raise ValueError(f"Unknown torch_dtype: {s}")
+def _now_iso() -> str:
+    return time.strftime("%Y-%m-%dT%H:%M:%S", time.localtime())
+def _safe_exp(x: float) -> float:
+    x = min(float(x), 50.0)
+    return float(math.exp(x))
+def _ensure_dir(p: Path) -> Path:
+    p.mkdir(parents=True, exist_ok=True)
+    return p
+def _looks_like_model_dir(p: Path) -> bool:
+    if not p.exists() or not p.is_dir():
+        return False
+    if (p / "config.json").exists():
+        return True
+    if any(p.glob("*.safetensors")) or any(p.glob("pytorch_model*.bin")):
+        return True
+    return False
+def _detect_text_field(example: Dict[str, Any]) -> Optional[str]:
+    for k, v in example.items():
+        if isinstance(v, str) and v.strip():
+            return k
+    return None
+def _load_tokenizer(base_dir: Path, use_fast: bool, trust_remote_code: bool):
+    try:
+        return AutoTokenizer.from_pretrained(
+            str(base_dir),
+            use_fast=use_fast,
+            trust_remote_code=trust_remote_code,
+        )
+    except ValueError as e:
+        if "TokenizersBackend" not in str(e):
+            raise
+        tok_file = base_dir / "tokenizer.json"
+        tok_cfg_path = base_dir / "tokenizer_config.json"
+        if not tok_file.exists():
+            raise
+        tok_kwargs: Dict[str, Any] = {}
+        if tok_cfg_path.exists():
+            with tok_cfg_path.open("r", encoding="utf-8") as f:
+                tok_cfg = json.load(f)
+            for key in ("bos_token", "eos_token", "pad_token", "unk_token", "model_max_length"):
+                if tok_cfg.get(key) is not None:
+                    tok_kwargs[key] = tok_cfg[key]
+            extra = tok_cfg.get("additional_special_tokens") or tok_cfg.get("extra_special_tokens")
+            if extra:
+                tok_kwargs["additional_special_tokens"] = extra
+        return PreTrainedTokenizerFast(tokenizer_file=str(tok_file), **tok_kwargs)
+def _infer_target_modules(model) -> List[str]:
+    names = set()
+    for n, _ in model.named_modules():
+        names.add(n.split(".")[-1])
+    for group in [
+        ["q_proj", "k_proj", "v_proj", "o_proj"],
+        ["Wqkv", "out_proj"],
+        ["query_key_value", "dense"],
+        ["c_attn", "c_proj"],
+    ]:
+        if all(x in names for x in group):
+            return group
+    fallback = [x for x in ["q_proj", "k_proj", "v_proj", "o_proj", "c_attn", "c_proj", "out_proj", "dense"] if x in names]
+    if fallback:
+        return fallback
+    raise ValueError("Could not auto-infer target_modules. Set peft.target_modules explicitly.")
+def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
+    return cfg.get("model", {}).get("attn_implementation", None)
+# --------------------------
+# Wandb Integration
+# --------------------------
+def setup_wandb(cfg: Dict[str, Any], run_dir: Path):
+    """Initialize Wandb if enabled in configuration."""
+    wandb_cfg = cfg.get("wandb", {})
+    if not wandb_cfg.get("enabled", False):
+        print("Wandb logging disabled")
+        return None
+    if not WANDB_AVAILABLE:
+        print("Wandb not available. Install with: pip install wandb")
+        return None
+    # Extract wandb configuration
+    project = wandb_cfg.get("project", "cpt-training")
+    entity = wandb_cfg.get("entity", None)
+    name = wandb_cfg.get("name", None)
+    tags = wandb_cfg.get("tags", [])
+    notes = wandb_cfg.get("notes", None)
+    # Initialize wandb
+    try:
+        wandb.init(
+            project=project,
+            entity=entity,
+            name=name,
+            tags=tags,
+            notes=notes,
+            dir=str(run_dir),
+            config={
+                "model": cfg.get("model", {}),
+                "data": cfg.get("data", {}),
+                "peft": cfg.get("peft", {}),
+                "train": cfg.get("train", {}),
+                "run_dir": str(run_dir),
+            }
+        )
+        print(f"Wandb initialized: project='{project}', name='{name or 'auto-generated'}'")
+        return wandb
+    except Exception as e:
+        print(f"Failed to initialize Wandb: {e}")
+        return None
+def finish_wandb():
+    """Finish Wandb run if active."""
+    if WANDB_AVAILABLE and wandb.run is not None:
+        wandb.finish()
+        print("Wandb run finished")
+# --------------------------
+# JSONL Logger Callback
+# --------------------------
+class JsonlLoggerCallback(TrainerCallback):
+    def __init__(self, run_dir: Path):
+        self.run_dir = run_dir
+        self.train_log_path = _ensure_dir(run_dir / "logs") / "train.jsonl"
+        self.eval_log_path = _ensure_dir(run_dir / "logs") / "eval.jsonl"
+        self.start_time = None
+    def _eta(self, global_step: int, max_steps: int) -> Optional[str]:
+        if self.start_time is None or global_step <= 0 or max_steps <= 0:
+            return None
+        elapsed = time.time() - self.start_time
+        sec_per_step = elapsed / global_step
+        remaining = max(0, max_steps - global_step) * sec_per_step
+        h = int(remaining // 3600)
+        m = int((remaining % 3600) // 60)
+        s = int(remaining % 60)
+        return f"{h:02d}:{m:02d}:{s:02d}"
+    def on_train_begin(self, args, state, control, **kwargs):
+        self.start_time = time.time()
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        max_steps = int(state.max_steps) if getattr(state, "max_steps", None) else 0
+        progress_pct = (100.0 * state.global_step / max_steps) if max_steps > 0 else None
+        epoch_pct = None
+        if state.epoch is not None and args.num_train_epochs and args.num_train_epochs > 0:
+            epoch_pct = 100.0 * (float(state.epoch) / float(args.num_train_epochs))
+        payload = {
+            "ts": _now_iso(),
+            "event": "train_log",
+            "step": int(state.global_step),
+            "epoch": round(float(state.epoch), 4) if state.epoch is not None else None,
+            "progress_pct": round(progress_pct, 2) if progress_pct is not None else None,
+            "epoch_pct": round(epoch_pct, 2) if epoch_pct is not None else None,
+            "eta": self._eta(int(state.global_step), max_steps),
+            "max_grad_norm": getattr(args, "max_grad_norm", None),
+            **logs,
+        }
+        with self.train_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+        if not metrics:
+            return
+        eval_loss = metrics.get("eval_loss", None)
+        ppl = _safe_exp(eval_loss) if eval_loss is not None else None
+        payload = {
+            "ts": _now_iso(),
+            "event": "eval",
+            "step": int(state.global_step),
+            "epoch": float(state.epoch) if state.epoch is not None else None,
+            **metrics,
+            "perplexity": ppl,
+        }
+        with self.eval_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+# --------------------------
+# Data Pipeline (EOS + Packing)
+# --------------------------
+def build_datasets(cfg: Dict[str, Any], tokenizer) -> Tuple[Any, Any]:
+    data_cfg = cfg["data"]
+    train_path = data_cfg["train_jsonl"]
+    eval_path = data_cfg.get("eval_jsonl", None)
+    split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
+    text_field = data_cfg.get("text_field", "text")
+    block_size = int(data_cfg.get("block_size", 2048))
+    shuffle = bool(data_cfg.get("shuffle", True))
+    num_proc = int(data_cfg.get("num_proc", 4))
+    pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
+    if pack_mode not in ("drop", "pad"):
+        raise ValueError(f"data.pack_mode must be 'drop' or 'pad', got: {pack_mode}")
+    eos_id = tokenizer.eos_token_id
+    if eos_id is None:
+        raise ValueError("Tokenizer has no eos_token_id; CPT packing needs an EOS delimiter.")
+    if tokenizer.pad_token_id is None:
+        # safe default for many causal LMs
+        tokenizer.pad_token = tokenizer.eos_token
+    pad_id = tokenizer.pad_token_id
+    ds = load_dataset("json", data_files={"train": train_path})
+    if eval_path:
+        ds_eval = load_dataset("json", data_files={"eval": eval_path})
+        dsd = DatasetDict({"train": ds["train"], "eval": ds_eval["eval"]})
+    else:
+        if 0.0 < split_ratio < 1.0:
+            split = ds["train"].train_test_split(test_size=split_ratio, seed=int(cfg["run"].get("seed", 42)))
+            dsd = DatasetDict({"train": split["train"], "eval": split["test"]})
+        else:
+            dsd = DatasetDict({"train": ds["train"], "eval": None})
+    if text_field not in dsd["train"].column_names:
+        auto_field = _detect_text_field(dsd["train"][0])
+        if not auto_field:
+            raise ValueError(f"Could not find text field. Columns: {dsd['train'].column_names}")
+        text_field = auto_field
+    def tokenize_fn(examples):
+        out = tokenizer(
+            examples[text_field],
+            add_special_tokens=False,
+            truncation=False,
+            padding=False,
+        )
+        if "token_type_ids" in out:
+            del out["token_type_ids"]
+        # Add EOS between docs
+        out["input_ids"] = [ids + [eos_id] for ids in out["input_ids"]]
+        out["attention_mask"] = [m + [1] for m in out["attention_mask"]]
+        return out
+    tokenized_train = dsd["train"].map(
+        tokenize_fn,
+        batched=True,
+        num_proc=num_proc,
+        remove_columns=dsd["train"].column_names,
+        desc="Tokenizing train",
+    )
+    tokenized_eval = None
+    if dsd["eval"] is not None:
+        tokenized_eval = dsd["eval"].map(
+            tokenize_fn,
+            batched=True,
+            num_proc=num_proc,
+            remove_columns=dsd["eval"].column_names,
+            desc="Tokenizing eval",
+        )
+    def group_texts(examples):
+        concatenated = {k: sum(examples[k], []) for k in examples.keys()}
+        total_length = len(concatenated["input_ids"])
+        if total_length == 0:
+            return {"input_ids": [], "attention_mask": [], "labels": []}
+        full_len = (total_length // block_size) * block_size
+        blocks_input, blocks_attn, blocks_labels = [], [], []
+        # full blocks
+        for i in range(0, full_len, block_size):
+            chunk = concatenated["input_ids"][i:i + block_size]
+            attn = concatenated["attention_mask"][i:i + block_size]
+            blocks_input.append(chunk)
+            blocks_attn.append(attn)
+            blocks_labels.append(chunk.copy())
+        # remainder
+        remainder = total_length - full_len
+        if remainder > 0 and pack_mode == "pad":
+            chunk = concatenated["input_ids"][full_len:full_len + remainder]
+            attn = concatenated["attention_mask"][full_len:full_len + remainder]
+            pad_len = block_size - remainder
+            chunk_padded = chunk + [pad_id] * pad_len
+            attn_padded = attn + [0] * pad_len
+            labels = chunk_padded.copy()
+            labels[-pad_len:] = [-100] * pad_len  # loss mask
+            blocks_input.append(chunk_padded)
+            blocks_attn.append(attn_padded)
+            blocks_labels.append(labels)
+        return {
+            "input_ids": blocks_input,
+            "attention_mask": blocks_attn,
+            "labels": blocks_labels,
+        }
+    tokenized_train = tokenized_train.map(
+        group_texts,
+        batched=True,
+        num_proc=num_proc,
+        desc=f"Packing train blocks (mode={pack_mode})",
+    )
+    if tokenized_eval is not None:
+        tokenized_eval = tokenized_eval.map(
+            group_texts,
+            batched=True,
+            num_proc=num_proc,
+            desc=f"Packing eval blocks (mode={pack_mode})",
+        )
+    if len(tokenized_train) == 0:
+        raise ValueError(
+            "Train dataset is empty after packing. "
+            "Either increase data, reduce block_size, or set data.pack_mode='pad'."
+        )
+    if shuffle:
+        tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
+    return tokenized_train, tokenized_eval
+# --------------------------
+# Model Loading + PEFT
+# --------------------------
+def _select_model_loader(base_dir: Path):
+    cfg_path = base_dir / "config.json"
+    if not cfg_path.exists():
+        return {"kind": "causal", "arch": None}
+    with cfg_path.open("r", encoding="utf-8") as f:
+        cfg = json.load(f)
+    arch = cfg.get("architectures") or []
+    arch_name = arch[0] if arch else None
+    if any("ForConditionalGeneration" in a for a in arch):
+        return {"kind": "conditional", "arch": arch_name}
+    return {"kind": "causal", "arch": arch_name}
+def _resolve_model_class(arch_name: str):
+    import transformers
+    cls = getattr(transformers, arch_name, None)
+    if cls is None:
+        raise ValueError(f"Model class '{arch_name}' is not available in installed transformers.")
+    return cls
+def load_base_model_and_tokenizer(cfg: Dict[str, Any], base_dir: Path):
+    model_cfg = cfg["model"]
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    device_map = model_cfg.get("device_map", "auto")
+    tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    torch_dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    quant_cfg = None
+    if use_4bit:
+        if BitsAndBytesConfig is None:
+            raise ImportError("BitsAndBytesConfig is not available in this transformers version.")
+        quant_cfg = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")),
+            bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True)),
+            bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")),
+        )
+    attn_impl = _choose_attn_impl(cfg)
+    model_meta = _select_model_loader(base_dir)
+    try:
+        if model_meta["kind"] == "conditional":
+            model_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+            if model_cls is None:
+                raise ValueError("Conditional model architecture not specified in config.json.")
+            model = model_cls.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+                attn_implementation=attn_impl,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+                attn_implementation=attn_impl,
+            )
+    except Exception as e:
+        if attn_impl is not None:
+            print(f"[warn] attn_implementation='{attn_impl}' failed: {e}")
+            print("[warn] Falling back to default attention implementation.")
+        if model_meta["kind"] == "conditional":
+            model_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+            if model_cls is None:
+                raise ValueError("Conditional model architecture not specified in config.json.")
+            model = model_cls.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+            )
+    return model, tokenizer
+def apply_peft(cfg: Dict[str, Any], model):
+    peft_cfg = cfg["peft"]
+    model_cfg = cfg["model"]
+    tr_cfg = cfg["train"]
+    if not bool(peft_cfg.get("enabled", True)):
+        return model, None
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
+    if gradient_checkpointing and hasattr(model, "gradient_checkpointing_enable"):
+        model.gradient_checkpointing_enable()
+        if hasattr(model, "config"):
+            model.config.use_cache = False
+    if use_4bit:
+        model = prepare_model_for_kbit_training(
+            model,
+            use_gradient_checkpointing=gradient_checkpointing,
+        )
+    target_modules = peft_cfg.get("target_modules", "auto")
+    if target_modules == "auto":
+        target_modules = _infer_target_modules(model)
+    lora_config = LoraConfig(
+        r=int(peft_cfg.get("r", 16)),
+        lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
+        lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
+        bias=str(peft_cfg.get("bias", "none")),
+        task_type="CAUSAL_LM",
+        target_modules=target_modules,
+    )
+    model = get_peft_model(model, lora_config)
+    return model, lora_config
+# --------------------------
+# Merge Logic
+# --------------------------
+def merge_adapter(cfg: Dict[str, Any], base_dir: Path, adapter_dir: Path, final_dir: Path):
+    print(f"--- Merge: {adapter_dir} + {base_dir} -> {final_dir} ---")
+    model_cfg = cfg["model"]
+    merge_cfg = cfg.get("merge", {})
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
+    max_shard_size = str(merge_cfg.get("max_shard_size", "2GB"))
+    model_meta = _select_model_loader(base_dir)
+    if model_meta["kind"] == "conditional":
+        base_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+        if base_cls is None:
+            raise ValueError("Conditional model architecture not specified in config.json.")
+        base = base_cls.from_pretrained(
+            str(base_dir),
+            torch_dtype=merged_dtype,
+            device_map="cpu",
+            low_cpu_mem_usage=True,
+            trust_remote_code=trust_remote_code,
+        )
+    else:
+        base = AutoModelForCausalLM.from_pretrained(
+            str(base_dir),
+            torch_dtype=merged_dtype,
+            device_map="cpu",
+            low_cpu_mem_usage=True,
+            trust_remote_code=trust_remote_code,
+        )
+    merged = PeftModel.from_pretrained(base, str(adapter_dir))
+    merged = merged.merge_and_unload()
+    _ensure_dir(final_dir)
+    merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
+    tok = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+    if tok.pad_token is None:
+        tok.pad_token = tok.eos_token
+    tok.save_pretrained(str(final_dir))
+    print("--- Merge complete ---")
+# --------------------------
+# Main
+# --------------------------
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--config", required=True, help="Path to YAML config")
+    ap.add_argument("--merge-only", action="store_true", help="Skip training, just merge adapter")
+    args = ap.parse_args()
+    with open(args.config, "r", encoding="utf-8") as f:
+        cfg = yaml.safe_load(f)
+    run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
+    _ensure_dir(run_dir / "logs")
+    with (run_dir / "config_resolved.yaml").open("w", encoding="utf-8") as f:
+        yaml.safe_dump(cfg, f, sort_keys=False)
+    model_cfg = cfg["model"]
+    repo_id = str(model_cfg["repo_id"]).strip()
+    repo_path = Path(repo_id)
+    # ✅ Local model path -> load directly; no download
+    if repo_path.exists() and repo_path.is_dir():
+        base_dir = repo_path
+        if not _looks_like_model_dir(base_dir):
+            raise ValueError(f"model.repo_id points to a directory, but it doesn't look like a HF model dir: {base_dir}")
+    else:
+        # HF repo_id -> download into run_dir/base_local_dir
+        base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
+        if not _looks_like_model_dir(base_dir):
+            print(f"Base model not found at {base_dir}, downloading from {repo_id} ...")
+            snapshot_download(
+                repo_id=repo_id,
+                revision=model_cfg.get("revision", None),
+                local_dir=str(base_dir),
+                local_dir_use_symlinks=False,
+            )
+    ckpt_dir = _ensure_dir(run_dir / "checkpoints")
+    best_adapter_dir = _ensure_dir(run_dir / "best_adapter")
+    merge_cfg = cfg.get("merge", {}) or {}
+    if merge_cfg.get("output_dir"):
+        od = Path(str(merge_cfg["output_dir"]))
+        final_dir = od if od.is_absolute() else (run_dir / od)
+    else:
+        final_dir = run_dir / "final_model"
+    # Merge-only
+    if args.merge_only:
+        if not _looks_like_model_dir(best_adapter_dir):
+            raise FileNotFoundError(f"Adapter not found at {best_adapter_dir}")
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+        return
+    # Initialize Wandb
+    wandb_run = setup_wandb(cfg, run_dir)
+    # Training
+    set_seed(int(cfg["run"].get("seed", 42)))
+    model, tokenizer = load_base_model_and_tokenizer(cfg, base_dir)
+    model, _ = apply_peft(cfg, model)
+    train_ds, eval_ds = build_datasets(cfg, tokenizer)
+    tr_cfg = cfg["train"]
+    dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_fp16 = (dtype == torch.float16)
+    use_bf16 = (dtype == torch.bfloat16)
+    max_steps = int(tr_cfg.get("max_steps", 0))
+    num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
+    # --- Dynamic evaluation strategy parameter handling ---
+    ta_params = inspect.signature(TrainingArguments.__init__).parameters
+    eval_key = "eval_strategy" if "eval_strategy" in ta_params else "evaluation_strategy"
+    # Setup reporting based on wandb availability
+    report_to = []
+    if wandb_run is not None:
+        report_to.append("wandb")
+    desired_ta_kwargs = dict(
+        output_dir=str(ckpt_dir),
+        max_steps=max_steps if max_steps > 0 else -1,
+        num_train_epochs=num_train_epochs,
+        per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)),
+        per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1))),
+        gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)),
+        learning_rate=float(tr_cfg.get("learning_rate", 2e-5)),
+        weight_decay=float(tr_cfg.get("weight_decay", 0.0)),
+        warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)),
+        lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")),
+        optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch")),
+        max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)),
+        logging_steps=int(tr_cfg.get("logging_steps", 10)),
+        save_strategy=str(tr_cfg.get("save_strategy", "steps")),
+        save_steps=int(tr_cfg.get("save_steps", 200)),
+        save_total_limit=int(tr_cfg.get("save_total_limit", 3)),
+        eval_steps=int(tr_cfg.get("eval_steps", 200)),
+        load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False,
+        metric_for_best_model="eval_loss",
+        greater_is_better=False,
+        fp16=use_fp16,
+        bf16=use_bf16,
+        report_to=report_to,
+        remove_unused_columns=False,
+        save_safetensors=True,
+        overwrite_output_dir=False,
+    )
+    # Set the correct argument name for this transformers version
+    desired_ta_kwargs[eval_key] = str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
+    ta_kwargs = {k: v for k, v in desired_ta_kwargs.items() if k in ta_params}
+    training_args = TrainingArguments(**ta_kwargs)
+    trainer_params = inspect.signature(Trainer.__init__).parameters
+    desired_trainer_kwargs = dict(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=eval_ds,
+        tokenizer=tokenizer,
+        processing_class=tokenizer,
+        data_collator=default_data_collator,
+        callbacks=[JsonlLoggerCallback(run_dir)],
+    )
+    trainer_kwargs = {k: v for k, v in desired_trainer_kwargs.items() if k in trainer_params}
+    trainer = Trainer(**trainer_kwargs)
+    # Resume
+    resume_from = tr_cfg.get("resume_from_checkpoint", None)
+    if resume_from == "auto":
+        last = get_last_checkpoint(str(ckpt_dir))
+        resume_from = last if last else None
+        if resume_from:
+            print(f"Resuming from {resume_from}")
+    print("Starting training...")
+    trainer.train(resume_from_checkpoint=resume_from)
+    trainer.save_model(str(best_adapter_dir))
+    print(f"Saved best adapter -> {best_adapter_dir}")
+    if eval_ds is not None:
+        metrics = trainer.evaluate()
+        eval_loss = metrics.get("eval_loss", None)
+        metrics["perplexity"] = _safe_exp(eval_loss) if eval_loss is not None else None
+        with (run_dir / "eval_final.json").open("w", encoding="utf-8") as f:
+            json.dump(metrics, f, indent=2)
+        print(f"Final eval_loss={eval_loss}, ppl={metrics['perplexity']}")
+    if bool(cfg.get("merge", {}).get("enabled", False)):
+        del trainer, model
+        torch.cuda.empty_cache()
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+    else:
+        print("Merge disabled. Run with --merge-only later if needed.")
+    # Finish Wandb run
+    finish_wandb()
+if __name__ == "__main__":
+    main()

trainer-kit/CPT/README.md ADDED Viewed

	@@ -0,0 +1,189 @@

+# Trainer‑Kit : Config‑Driven CPT (LoRA / QLoRA) with Packing, Logging, Resume, and Merge
+Trainer‑Kit is a small, config‑driven training runner for **continued pretraining (CPT)** on causal LMs.
+It supports **LoRA** and **QLoRA**, data **packing** (strict or padding‑masked), **checkpointing + resume**, **JSONL logging**, periodic **eval with perplexity**, and an optional **merge** step to export a final merged model.
+---
+## What we built
+### ✅ Core goals implemented
+* **CPT training loop** controlled entirely via a **YAML config**
+* **Local model support** (load from filesystem) and optional **HF download** (if `repo_id` is a hub id)
+* **JSONL datasets** for train (+ optional eval split)
+* **CPT‑style token stream packing** into fixed‑length blocks
+* **Two packing modes**
+  * `drop`: strict CPT, drop remainder tokens (preferred for real CPT)
+  * `pad`: pad the remainder to `block_size` and **mask loss** on padding (useful for small datasets / debugging)
+* **Checkpointing + resume**
+  * `resume_from_checkpoint: "auto"` resumes from the latest checkpoint under `run_dir/checkpoints`
+* **JSONL logs** written locally
+  * training logs: `run_dir/logs/train.jsonl`
+  * eval logs: `run_dir/logs/eval.jsonl`
+* **Evaluation**
+  * logs `eval_loss` and computed `perplexity = exp(eval_loss)` (with safe overflow guard)
+* **Adapter output**
+  * saves the final/best adapter to `run_dir/best_adapter`
+* **Merge workflow**
+  * `--merge-only` merges an existing adapter later
+  * merge is done **on CPU** to avoid GPU OOM
+  * merged model is stored under the configured merge output directory (relative to `run_dir` if a relative path)
+---
+## Repository layout (outputs)
+A run produces the following structure under `run.run_dir`:
+```
+runs/<run_name>/
+├─ checkpoints/            # trainer checkpoints (for resume)
+├─ best_adapter/           # saved LoRA adapter
+├─ logs/
+│  ├─ train.jsonl          # step-wise training logs
+│  └─ eval.jsonl           # eval logs (eval_loss + perplexity)
+├─ eval_final.json         # final eval metrics summary (if eval is enabled)
+└─ config_resolved.yaml    # exact config used for the run
+```
+If merge is used, the merged model is written to:
+* `run_dir/<merge.output_dir>` if `merge.output_dir` is relative (e.g. `./merged_model`)
+* or the absolute path if it is absolute.
+---
+## Supported training modes
+### 1) LoRA vs QLoRA (same script)
+* **QLoRA** happens when `model.use_4bit: true`
+  * base weights are loaded in 4‑bit using bitsandbytes
+  * training updates only LoRA parameters
+* **LoRA** happens when `model.use_4bit: false`
+  * base weights are loaded in fp16/bf16 (as configured)
+  * training updates only LoRA parameters
+No “full finetune” mode is enabled by default in this runner.
+---
+## Data pipeline (CPT behavior)
+### Input format
+* JSONL file where each line contains a text field (default `"text"`).
+* Example:
+  * `{"text": "some training text..."}`
+### Packing (token stream → fixed blocks)
+* Each sample is tokenized without truncation.
+* An **EOS token is appended** per document to preserve boundaries.
+* Token lists are concatenated and converted into **fixed‑length blocks** of `data.block_size`.
+Two modes:
+* **`drop` (strict CPT):** remainder tokens that don’t fill a full block are discarded.
+* **`pad` (debug/small data):** remainder is padded to block_size:
+  * `attention_mask = 0` for padded positions
+  * `labels = -100` for padded positions (loss masking)
+This is what allowed training to proceed even with tiny dummy datasets at `block_size=1024`.
+---
+## Logging
+Trainer‑Kit writes **machine‑readable logs** in JSONL.
+### Training logs (`logs/train.jsonl`)
+Includes entries with:
+* `step`
+* `loss`
+* `grad_norm`
+* `learning_rate`
+* `progress_pct` (step progress when `max_steps` is active)
+* ETA estimation
+### Eval logs (`logs/eval.jsonl`)
+Includes:
+* `eval_loss`
+* `perplexity`
+Notes:
+* When using `max_steps`, the Trainer’s internal `epoch` counter can grow unexpectedly on tiny datasets (because steps/epoch becomes ~1).
+  **Use `progress_pct` as the reliable indicator** for step‑based runs.
+---
+## Checkpointing and resume
+The trainer saves checkpoints under:
+* `run_dir/checkpoints/`
+Resume options:
+* `resume_from_checkpoint: "auto"` → picks the latest checkpoint automatically
+* `resume_from_checkpoint: "/path/to/checkpoint"` → resumes from a specific checkpoint
+* `resume_from_checkpoint: null` → fresh run
+---
+## Merging adapters into a final model
+Trainer‑Kit supports exporting a merged model:
+### Merge after training
+* Enable merge in config (`merge.enabled: true`)
+* The script will:
+  1. save the adapter
+  2. free GPU memory
+  3. reload base model on **CPU**
+  4. load adapter
+  5. `merge_and_unload()`
+  6. save final merged model
+### Merge later
+Run:
+```
+python run_cpt.py --config config.yaml --merge-only
+```
+This skips training and merges `run_dir/best_adapter` into the base model.
+---
+## How to run
+### Train
+```
+python run_cpt.py --config config.yaml
+```
+### Merge only
+```
+python run_cpt.py --config config.yaml --merge-only

trainer-kit/CPT/commands.md ADDED Viewed

	@@ -0,0 +1,15 @@

+## Commands
+Train (no merge):
+```bash
+python run_cpt.py --config config.yaml
+```
+Merge later:
+```bash
+python run_cpt.py --config config.yaml --merge-only
+```
+---

trainer-kit/CPT/config.yaml ADDED Viewed

	@@ -0,0 +1,82 @@

+run:
+  run_dir: "./runs/cpt_run_v1"
+  seed: 42
+model:
+  # Local model path (no download)
+  repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
+  revision: null
+  # Used only when repo_id is a HF repo (not a local path)
+  base_local_dir: "base_model"
+  trust_remote_code: true
+  tokenizer_use_fast: true
+  device_map: "auto"
+  torch_dtype: "bfloat16"  # "float16" | "bfloat16" | "float32"
+  # QLoRA
+  use_4bit: false
+  bnb_4bit_quant_type: "nf4"
+  bnb_4bit_use_double_quant: false
+  bnb_4bit_compute_dtype: "bfloat16"
+  # optional: "flash_attention_2" | "sdpa" | null
+  attn_implementation: null
+data:
+  train_jsonl: "/workspace/all_data_with_descriptions.jsonl"
+  eval_jsonl: null
+  eval_split_ratio: 0.1
+  text_field: "text"
+  block_size: 4096
+  shuffle: true
+  num_proc: 4
+  # ✅ NEW: packing behavior
+  # "drop" = strict CPT (drop remainder)
+  # "pad"  = pad remainder to block_size + loss mask (-100) + attention_mask=0
+  pack_mode: "pad"
+peft:
+  enabled: true
+  r: 64
+  lora_alpha: 128
+  lora_dropout: 0.05
+  bias: "none"
+  target_modules: "auto"
+train:
+  #max_steps: 1000
+  num_train_epochs: 2
+  per_device_train_batch_size: 1
+  per_device_eval_batch_size: 1
+  gradient_accumulation_steps: 16
+  learning_rate: 2e-5
+  weight_decay: 0.0
+  warmup_ratio: 0.1
+  lr_scheduler_type: "cosine"
+  optim: "paged_adamw_8bit"
+  max_grad_norm: 1.0
+  gradient_checkpointing: true
+  logging_steps: 1
+  save_strategy: "steps"
+  save_steps: 100
+  save_total_limit: 4
+  evaluation_strategy: "steps"
+  eval_steps: 50
+  load_best_model_at_end: true
+  resume_from_checkpoint: "auto"
+merge:
+  enabled: true
+  merged_dtype: "float16"
+  max_shard_size: "2GB"
+  output_dir: "./merged_24b_cpt_lora"

trainer-kit/CPT/detailed_parameter_documentation.md ADDED Viewed

	@@ -0,0 +1,795 @@

+# CPT Configuration Parameters: Detailed Guide
+This document provides a comprehensive explanation of all configuration parameters in `config.yaml` and how they're implemented in `run_cpt.py`.
+## Table of Contents
+- [Run Parameters](#run-parameters)
+- [Model Parameters](#model-parameters)
+- [Data Parameters](#data-parameters)
+- [PEFT Parameters](#peft-parameters)
+- [Training Parameters](#training-parameters)
+- [Merge Parameters](#merge-parameters)
+---
+## Run Parameters
+### `run.run_dir`
+- **Type**: String (path)
+- **Required**: Yes
+- **Default**: No default
+- **Description**: Directory where training outputs will be saved
+- **Used in**: Line ~480 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
+  ```
+- **Example Values**:
+  - `./runs/cpt_run_v1`
+  - `/workspace/outputs/my_experiment`
+  - `./checkpoints/cpt_experiment`
+### `run.seed`
+- **Type**: Integer
+- **Required**: No
+- **Default**: None
+- **Description**: Random seed for reproducibility
+- **Used in**: Lines ~460, ~240 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  set_seed(int(cfg["run"].get("seed", 42)))
+  # Used in data shuffling and train/test split
+  ```
+- **Example Values**: `42`, `123`, `2023`
+---
+## Model Parameters
+### `model.repo_id`
+- **Type**: String (path or HuggingFace repo)
+- **Required**: Yes
+- **Default**: No default
+- **Description**: Model identifier - can be local path or HuggingFace repository
+- **Used in**: Lines ~480-500 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  repo_id = str(model_cfg["repo_id"]).strip()
+  repo_path = Path(repo_id)
+  if repo_path.exists() and repo_path.is_dir():
+      base_dir = repo_path  # Local path
+  else:
+      # Download from HuggingFace
+      snapshot_download(repo_id=repo_id, ...)
+  ```
+- **Example Values**:
+  - Local: `/workspace/Models/Devstral-Small-2-24B-Instruct-2512`
+  - HF Repo: `meta-llama/Llama-2-7b-hf`
+### `model.revision`
+- **Type**: String or null
+- **Required**: No
+- **Default**: null
+- **Description**: Specific model revision/branch/tag from HuggingFace
+- **Used in**: Line ~495 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  snapshot_download(
+      repo_id=repo_id,
+      revision=model_cfg.get("revision", None),
+      ...
+  )
+  ```
+- **Example Values**:
+  - `"main"` - Main branch
+  - `"v1.0"` - Specific tag
+  - `"abc123def"` - Specific commit hash
+  - `null` - Latest version
+### `model.base_local_dir`
+- **Type**: String (path)
+- **Required**: No
+- **Default**: `"base_model"`
+- **Description**: Directory name for downloaded model when using HF repo
+- **Used in**: Line ~495 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
+  ```
+- **Example Values**: `"base_model"`, `"downloaded_model"`, `"model_files"`
+### `model.trust_remote_code`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true`
+- **Description**: Allow loading models with custom code
+- **Used in**: Lines ~320, ~340, ~450 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+  model = AutoModelForCausalLM.from_pretrained(..., trust_remote_code=trust_remote_code, ...)
+  ```
+- **Example Values**: `true`, `false`
+### `model.tokenizer_use_fast`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true`
+- **Description**: Use fast tokenizer implementation
+- **Used in**: Lines ~320, ~450 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+  ```
+- **Example Values**: `true`, `false`
+### `model.device_map`
+- **Type**: String
+- **Required**: No
+- **Default**: `"auto"`
+- **Description**: How to distribute model across devices
+- **Used in**: Lines ~350, ~370 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  model = AutoModelForCausalLM.from_pretrained(..., device_map=device_map, ...)
+  ```
+- **Example Values**:
+  - `"auto"` - Automatic distribution
+  - `"cpu"` - CPU only
+  - `"cuda:0"` - Single GPU
+  - `{"": 0}` - Manual mapping
+### `model.torch_dtype`
+- **Type**: String
+- **Required**: No
+- **Default**: `"bfloat16"`
+- **Description**: Data type for model tensors
+- **Used in**: Lines ~45, ~350 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  def _dtype_from_str(s: str) -> torch.dtype:
+      if s in ("float16", "fp16"): return torch.float16
+      if s in ("bfloat16", "bf16"): return torch.bfloat16
+      if s in ("float32", "fp32"): return torch.float32
+  ```
+- **Example Values**:
+  - `"float16"` - 16-bit floats (faster, less memory, less stable)
+  - `"bfloat16"` - Brain float16 (stable, good for training)
+  - `"float32"` - 32-bit floats (slowest, most memory)
+### `model.use_4bit`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `false`
+- **Description**: Use 4-bit quantization for memory efficiency
+- **Used in**: Lines ~325, ~395 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  use_4bit = bool(model_cfg.get("use_4bit", False))
+  if use_4bit:
+      quant_cfg = BitsAndBytesConfig(load_in_4bit=True, ...)
+  ```
+- **Example Values**: `true`, `false`
+### `model.bnb_4bit_quant_type`
+- **Type**: String
+- **Required**: No
+- **Default**: `"nf4"`
+- **Description**: 4-bit quantization type
+- **Used in**: Lines ~328 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4"))
+  ```
+- **Example Values**:
+  - `"nf4"` - NormalFloat4 (recommended)
+  - `"fp4"` - FloatingPoint4
+  - `"int4"` - Integer4
+### `model.bnb_4bit_use_double_quant`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `false`
+- **Description**: Use double quantization for memory efficiency
+- **Used in**: Lines ~329 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True))
+  ```
+- **Example Values**: `true`, `false`
+### `model.bnb_4bit_compute_dtype`
+- **Type**: String
+- **Required**: No
+- **Default**: `"bfloat16"`
+- **Description**: Compute dtype for 4-bit quantization
+- **Used in**: Lines ~330 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16"))
+  ```
+- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`
+### `model.attn_implementation`
+- **Type**: String or null
+- **Required**: No
+- **Default**: `null`
+- **Description**: Attention implementation to use
+- **Used in**: Lines ~155, ~350 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
+      return cfg.get("model", {}).get("attn_implementation", None)
+  # Used in model.from_pretrained(..., attn_implementation=attn_impl, ...)
+  ```
+- **Example Values**:
+  - `"flash_attention_2"` - Flash Attention 2 (fastest)
+  - `"sdpa"` - Scaled Dot-Product Attention
+  - `null` - Default implementation
+---
+## Data Parameters
+### `data.train_jsonl`
+- **Type**: String (path)
+- **Required**: Yes
+- **Default**: No default
+- **Description**: Path to training data in JSONL format
+- **Used in**: Lines ~170 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  train_path = data_cfg["train_jsonl"]
+  ds = load_dataset("json", data_files={"train": train_path})
+  ```
+- **Example Values**: `"/workspace/all_data_with_descriptions.jsonl"`
+### `data.eval_jsonl`
+- **Type**: String (path) or null
+- **Required**: No
+- **Default**: `null`
+- **Description**: Path to evaluation data in JSONL format
+- **Used in**: Lines ~175 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  eval_path = data_cfg.get("eval_jsonl", None)
+  if eval_path:
+      ds_eval = load_dataset("json", data_files={"eval": eval_path})
+  ```
+- **Example Values**: `null` (no separate eval file), `"/workspace/eval_data.jsonl"`
+### `data.eval_split_ratio`
+- **Type**: Float
+- **Required**: No
+- **Default**: `0.1`
+- **Description**: Ratio of training data to use for evaluation split
+- **Used in**: Lines ~177 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
+  if 0.0 < split_ratio < 1.0:
+      split = ds["train"].train_test_split(test_size=split_ratio, seed=seed)
+  ```
+- **Example Values**: `0.1` (10%), `0.2` (20%), `0.05` (5%)
+### `data.text_field`
+- **Type**: String
+- **Required**: No
+- **Default**: `"text"`
+- **Description**: Field name in JSONL containing the text data
+- **Used in**: Lines ~185 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  text_field = data_cfg.get("text_field", "text")
+  # Used in tokenization
+  tokenized = dsd["train"].map(
+      tokenize_fn,
+      batched=True,
+      remove_columns=dsd["train"].column_names,
+      desc="Tokenizing train",
+  )
+  ```
+- **Example Values**: `"text"`, `"content"`, `"prompt"`, `"input"`
+### `data.block_size`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `4096`
+- **Description**: Maximum sequence length for training
+- **Used in**: Lines ~180 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  block_size = int(data_cfg.get("block_size", 2048))
+  # Used in grouping texts into blocks
+  for i in range(0, full_len, block_size):
+      chunk = concatenated["input_ids"][i:i + block_size]
+  ```
+- **Example Values**: `2048`, `4096`, `8192`
+### `data.shuffle`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true`
+- **Description**: Whether to shuffle training data
+- **Used in**: Lines ~235 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  if shuffle:
+      tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
+  ```
+- **Example Values**: `true`, `false`
+### `data.num_proc`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `4`
+- **Description**: Number of processes for data loading
+- **Used in**: Lines ~200, ~210 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  num_proc = int(data_cfg.get("num_proc", 4))
+  tokenized_train = dsd["train"].map(
+      tokenize_fn,
+      batched=True,
+      num_proc=num_proc,
+      ...
+  )
+  ```
+- **Example Values**: `1`, `4`, `8`, `16`
+### `data.pack_mode`
+- **Type**: String
+- **Required**: No
+- **Default**: `"pad"`
+- **Description**: How to handle remainder tokens in final block
+- **Used in**: Lines ~150-230 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
+  if pack_mode == "pad":
+      # Pad remainder and mask loss
+      labels[-pad_len:] = [-100] * pad_len
+  # If "drop": ignore remainder entirely
+  ```
+- **Example Values**:
+  - `"drop"` - Drop incomplete blocks (strict CPT)
+  - `"pad"` - Pad incomplete blocks with masked loss
+---
+## PEFT Parameters
+### `peft.enabled`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true`
+- **Description**: Whether to use PEFT (Parameter-Efficient Fine-Tuning)
+- **Used in**: Lines ~395 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  if not bool(peft_cfg.get("enabled", True)):
+      return model, None
+  # Otherwise proceed with LoRA configuration
+  ```
+- **Example Values**: `true`, `false`
+### `peft.r`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `64`
+- **Description**: LoRA rank - dimension of low-rank matrices
+- **Used in**: Lines ~415 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  lora_config = LoraConfig(
+      r=int(peft_cfg.get("r", 16)),
+      ...
+  )
+  ```
+- **Example Values**: `8`, `16`, `32`, `64`, `128`
+- **Note**: Higher values = more parameters but potentially better performance
+### `peft.lora_alpha`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `128`
+- **Description**: LoRA alpha scaling parameter
+- **Used in**: Lines ~416 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  lora_config = LoraConfig(
+      lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
+      ...
+  )
+  ```
+- **Example Values**: `16`, `32`, `64`, `128`, `256`
+### `peft.lora_dropout`
+- **Type**: Float
+- **Required**: No
+- **Default**: `0.05`
+- **Description**: Dropout rate for LoRA layers
+- **Used in**: Lines ~417 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  lora_config = LoraConfig(
+      lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
+      ...
+  )
+  ```
+- **Example Values**: `0.0`, `0.05`, `0.1`, `0.2`
+### `peft.bias`
+- **Type**: String
+- **Required**: No
+- **Default**: `"none"`
+- **Description**: Bias training strategy
+- **Used in**: Lines ~418 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  lora_config = LoraConfig(
+      bias=str(peft_cfg.get("bias", "none")),
+      ...
+  )
+  ```
+- **Example Values**:
+  - `"none"` - No bias training
+  - `"all"` - Train all biases
+  - `"lora_only"` - Only LoRA bias
+### `peft.target_modules`
+- **Type**: String or List
+- **Required**: No
+- **Default**: `"auto"`
+- **Description**: Which modules to apply LoRA to
+- **Used in**: Lines ~405, ~140-170 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  target_modules = peft_cfg.get("target_modules", "auto")
+  if target_modules == "auto":
+      target_modules = _infer_target_modules(model)
+  ```
+- **Example Values**:
+  - `"auto"` - Automatic detection
+  - `["q_proj", "k_proj", "v_proj", "o_proj"]` - Explicit list
+  - `["mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]` - MLP only
+---
+## Training Parameters
+### `train.num_train_epochs`
+- **Type**: Float
+- **Required**: No
+- **Default**: `2`
+- **Description**: Number of epochs to train
+- **Used in**: Lines ~470 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
+  # Used in TrainingArguments
+  ```
+- **Example Values**: `1.0`, `2.0`, `3.5`
+### `train.per_device_train_batch_size`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `1`
+- **Description**: Training batch size per device
+- **Used in**: Lines ~475 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1))
+  ```
+- **Example Values**: `1`, `2`, `4`, `8`
+### `train.per_device_eval_batch_size`
+- **Type**: Integer
+- **Required**: No
+- **Default**: Same as train batch size
+- **Description**: Evaluation batch size per device
+- **Used in**: Lines ~476 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1)))
+  ```
+- **Example Values**: `1`, `2`, `4`, `8`
+### `train.gradient_accumulation_steps`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `16`
+- **Description**: Number of steps to accumulate gradients
+- **Used in**: Lines ~477 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1))
+  ```
+- **Example Values**: `1`, `4`, `8`, `16`, `32`
+### `train.learning_rate`
+- **Type**: Float
+- **Required**: No
+- **Default**: `2e-5`
+- **Description**: Learning rate for optimizer
+- **Used in**: Lines ~478 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  learning_rate=float(tr_cfg.get("learning_rate", 2e-5))
+  ```
+- **Example Values**: `1e-5`, `2e-5`, `5e-5`, `1e-4`
+### `train.weight_decay`
+- **Type**: Float
+- **Required**: No
+- **Default**: `0.0`
+- **Description**: Weight decay for regularization
+- **Used in**: Lines ~479 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  weight_decay=float(tr_cfg.get("weight_decay", 0.0))
+  ```
+- **Example Values**: `0.0`, `0.01`, `0.1`
+### `train.warmup_ratio`
+- **Type**: Float
+- **Required**: No
+- **Default**: `0.1`
+- **Description**: Ratio of steps for learning rate warmup
+- **Used in**: Lines ~480 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0))
+  ```
+- **Example Values**: `0.0`, `0.1`, `0.2`
+### `train.lr_scheduler_type`
+- **Type**: String
+- **Required**: No
+- **Default**: `"cosine"`
+- **Description**: Learning rate scheduler type
+- **Used in**: Lines ~481 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine"))
+  ```
+- **Example Values**:
+  - `"cosine"` - Cosine annealing
+  - `"linear"` - Linear decay
+  - `"constant"` - Constant rate
+  - `"polynomial"` - Polynomial decay
+### `train.optim`
+- **Type**: String
+- **Required**: No
+- **Default**: `"paged_adamw_8bit"` (if 4-bit), `"adamw_torch"` (otherwise)
+- **Description**: Optimizer type
+- **Used in**: Lines ~482 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch"))
+  ```
+- **Example Values**:
+  - `"adamw_torch"` - AdamW (standard)
+  - `"paged_adamw_8bit"` - Paged AdamW for 8-bit training
+  - `"sgd"` - SGD
+  - `"adafactor"` - Adafactor
+### `train.max_grad_norm`
+- **Type**: Float
+- **Required**: No
+- **Default**: `1.0`
+- **Description**: Maximum gradient norm for clipping
+- **Used in**: Lines ~483 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0))
+  ```
+- **Example Values**: `0.5`, `1.0`, `2.0`
+### `train.gradient_checkpointing`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true`
+- **Description**: Use gradient checkpointing to save memory
+- **Used in**: Lines ~396-400 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
+  if gradient_checkpointing:
+      model.gradient_checkpointing_enable()
+  ```
+- **Example Values**: `true`, `false`
+### `train.logging_steps`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `1`
+- **Description**: Log training progress every N steps
+- **Used in**: Lines ~485 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  logging_steps=int(tr_cfg.get("logging_steps", 10))
+  ```
+- **Example Values**: `1`, `10`, `50`, `100`
+### `train.save_strategy`
+- **Type**: String
+- **Required**: No
+- **Default**: `"steps"`
+- **Description**: When to save model checkpoints
+- **Used in**: Lines ~487 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  save_strategy=str(tr_cfg.get("save_strategy", "steps"))
+  ```
+- **Example Values**:
+  - `"steps"` - Save every N steps
+  - `"epochs"` - Save every epoch
+  - `"no"` - Don't save
+### `train.save_steps`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `100`
+- **Description**: Save checkpoint every N steps
+- **Used in**: Lines ~488 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  save_steps=int(tr_cfg.get("save_steps", 200))
+  ```
+- **Example Values**: `50`, `100`, `200`, `500`
+### `train.save_total_limit`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `4`
+- **Description**: Maximum number of checkpoints to keep
+- **Used in**: Lines ~489 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  save_total_limit=int(tr_cfg.get("save_total_limit", 3))
+  ```
+- **Example Values**: `1`, `2`, `3`, `5`
+### `train.evaluation_strategy`
+- **Type**: String
+- **Required**: No
+- **Default**: `"steps"` (if eval data), `"no"` (otherwise)
+- **Description**: When to evaluate model
+- **Used in**: Lines ~494 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  evaluation_strategy=str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
+  ```
+- **Example Values**:
+  - `"steps"` - Evaluate every N steps
+  - `"epochs"` - Evaluate every epoch
+  - `"no"` - Don't evaluate
+### `train.eval_steps`
+- **Type**: Integer
+- **Required**: No
+- **Default**: `50`
+- **Description**: Evaluate every N steps
+- **Used in**: Lines ~491 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  eval_steps=int(tr_cfg.get("eval_steps", 200))
+  ```
+- **Example Values**: `25`, `50`, `100`, `200`
+### `train.load_best_model_at_end`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `true` (if eval data), `false` (otherwise)
+- **Description**: Load best model at end of training
+- **Used in**: Lines ~492-493 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False
+  ```
+- **Example Values**: `true`, `false`
+### `train.resume_from_checkpoint`
+- **Type**: String
+- **Required**: No
+- **Default**: `"auto"`
+- **Description**: Resume training from checkpoint
+- **Used in**: Lines ~510-520 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  resume_from = tr_cfg.get("resume_from_checkpoint", None)
+  if resume_from == "auto":
+      last = get_last_checkpoint(str(ckpt_dir))
+      resume_from = last if last else None
+  ```
+- **Example Values**:
+  - `"auto"` - Auto-detect latest checkpoint
+  - `"checkpoint-100"` - Specific checkpoint
+  - `null` - Start from scratch
+---
+## Merge Parameters
+### `merge.enabled`
+- **Type**: Boolean
+- **Required**: No
+- **Default**: `false`
+- **Description**: Whether to merge LoRA adapters with base model
+- **Used in**: Lines ~545 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  if bool(cfg.get("merge", {}).get("enabled", False)):
+      merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+  ```
+- **Example Values**: `true`, `false`
+### `merge.merged_dtype`
+- **Type**: String
+- **Required**: No
+- **Default**: `"float16"`
+- **Description**: Data type for merged model
+- **Used in**: Lines ~430 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
+  ```
+- **Example Values**: `"float16"`, `"bfloat16"`, `"float32"`
+### `merge.max_shard_size`
+- **Type**: String
+- **Required**: No
+- **Default**: `"2GB"`
+- **Description**: Maximum size per shard when saving
+- **Used in**: Lines ~445 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  merged.save_pretrained(str(final_dir), safe_serialization=True, max_shard_size=max_shard_size)
+  ```
+- **Example Values**: `"1GB"`, `"2GB"`, `"5GB"`
+### `merge.output_dir`
+- **Type**: String (path)
+- **Required**: No
+- **Default**: `"./merged_model"`
+- **Description**: Directory for merged model output
+- **Used in**: Lines ~505-510 in `run_cpt.py`
+- **Implementation**:
+  ```python
+  if merge_cfg.get("output_dir"):
+      od = Path(str(merge_cfg["output_dir"]))
+      final_dir = od if od.is_absolute() else (run_dir / od)
+  else:
+      final_dir = run_dir / "final_model"
+  ```
+- **Example Values**: `"./merged_model"`, `"/workspace/final_model"`, `"./models/merged"`
+---
+## Parameter Dependencies and Interactions
+### Memory-Related Dependencies
+- `per_device_train_batch_size` + `gradient_accumulation_steps` = effective batch size
+- `block_size` affects memory usage significantly
+- `use_4bit` + `bnb_4bit_*` parameters work together for quantization
+- `gradient_checkpointing` can enable larger `block_size` or `batch_size`
+### Training Strategy Dependencies
+- `evaluation_strategy` requires either `eval_jsonl` or `eval_split_ratio > 0`
+- `load_best_model_at_end` requires `evaluation_strategy` to be enabled
+- `save_strategy` should be compatible with `evaluation_strategy`
+- `lr_scheduler_type` affects warmup calculations
+### Model-Specific Dependencies
+- `target_modules` must match the actual module names in your model
+- `torch_dtype` should be compatible with your GPU hardware
+- `device_map` affects whether you can use certain optimizations
+### Data Processing Dependencies
+- `text_field` must exist in your JSONL data
+- `pack_mode: "pad"` requires `block_size` to be set appropriately
+- `eval_split_ratio` is ignored if `eval_jsonl` is provided
+This comprehensive documentation should help you understand and configure all parameters in the CPT training system according to your specific needs and constraints.

trainer-kit/CPT/dummy_data.jsonl ADDED Viewed

	@@ -0,0 +1,6 @@

+{"text": "This is a test sentence for the dummy dataset."}
+{"text": "Another sentence to check if training works."}
+{"text": "We need enough data to form a batch."}
+{"text": "FSDP and LoRA are cool technologies."}
+{"text": "Fine-tuning LLMs is fun and useful."}
+{"text": "This is the end of the dummy dataset."}

trainer-kit/CPT/requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+# Core
+torch>=2.1.0
+transformers>=4.41.0
+datasets>=2.18.0
+accelerate>=0.30.0
+# PEFT / QLoRA
+peft>=0.11.1
+bitsandbytes>=0.43.1
+# Hugging Face Hub (local + download support)
+huggingface_hub>=0.23.0
+# Config + utilities
+pyyaml>=6.0
+tqdm>=4.66.0
+# Optional but recommended (tokenizers speed)
+tokenizers>=0.15.0
+safetensors>=0.4.2
+# Optional (for eval)
+rouge-score>=0.1.2

trainer-kit/CPT/run_cpt.py ADDED Viewed

	@@ -0,0 +1,708 @@

+import argparse
+import json
+import inspect  # Added for Transformers version compatibility
+import math
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple, List
+import torch
+import yaml
+from datasets import load_dataset, DatasetDict
+from huggingface_hub import snapshot_download
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    PreTrainedTokenizerFast,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from peft import (
+    LoraConfig,
+    get_peft_model,
+    prepare_model_for_kbit_training,
+    PeftModel,
+)
+try:
+    from transformers import BitsAndBytesConfig
+except ImportError:  # older transformers
+    BitsAndBytesConfig = None
+# --------------------------
+# Helpers
+# --------------------------
+def _dtype_from_str(s: str) -> torch.dtype:
+    s = (s or "").lower()
+    if s in ("float16", "fp16"):
+        return torch.float16
+    if s in ("bfloat16", "bf16"):
+        return torch.bfloat16
+    if s in ("float32", "fp32"):
+        return torch.float32
+    raise ValueError(f"Unknown torch_dtype: {s}")
+def _now_iso() -> str:
+    return time.strftime("%Y-%m-%dT%H:%M:%S", time.localtime())
+def _safe_exp(x: float) -> float:
+    x = min(float(x), 50.0)
+    return float(math.exp(x))
+def _ensure_dir(p: Path) -> Path:
+    p.mkdir(parents=True, exist_ok=True)
+    return p
+def _looks_like_model_dir(p: Path) -> bool:
+    if not p.exists() or not p.is_dir():
+        return False
+    if (p / "config.json").exists():
+        return True
+    if any(p.glob("*.safetensors")) or any(p.glob("pytorch_model*.bin")):
+        return True
+    return False
+def _detect_text_field(example: Dict[str, Any]) -> Optional[str]:
+    for k, v in example.items():
+        if isinstance(v, str) and v.strip():
+            return k
+    return None
+def _load_tokenizer(base_dir: Path, use_fast: bool, trust_remote_code: bool):
+    try:
+        return AutoTokenizer.from_pretrained(
+            str(base_dir),
+            use_fast=use_fast,
+            trust_remote_code=trust_remote_code,
+        )
+    except ValueError as e:
+        if "TokenizersBackend" not in str(e):
+            raise
+        tok_file = base_dir / "tokenizer.json"
+        tok_cfg_path = base_dir / "tokenizer_config.json"
+        if not tok_file.exists():
+            raise
+        tok_kwargs: Dict[str, Any] = {}
+        if tok_cfg_path.exists():
+            with tok_cfg_path.open("r", encoding="utf-8") as f:
+                tok_cfg = json.load(f)
+            for key in ("bos_token", "eos_token", "pad_token", "unk_token", "model_max_length"):
+                if tok_cfg.get(key) is not None:
+                    tok_kwargs[key] = tok_cfg[key]
+            extra = tok_cfg.get("additional_special_tokens") or tok_cfg.get("extra_special_tokens")
+            if extra:
+                tok_kwargs["additional_special_tokens"] = extra
+        return PreTrainedTokenizerFast(tokenizer_file=str(tok_file), **tok_kwargs)
+def _infer_target_modules(model) -> List[str]:
+    names = set()
+    for n, _ in model.named_modules():
+        names.add(n.split(".")[-1])
+    for group in [
+        ["q_proj", "k_proj", "v_proj", "o_proj"],
+        ["Wqkv", "out_proj"],
+        ["query_key_value", "dense"],
+        ["c_attn", "c_proj"],
+    ]:
+        if all(x in names for x in group):
+            return group
+    fallback = [x for x in ["q_proj", "k_proj", "v_proj", "o_proj", "c_attn", "c_proj", "out_proj", "dense"] if x in names]
+    if fallback:
+        return fallback
+    raise ValueError("Could not auto-infer target_modules. Set peft.target_modules explicitly.")
+def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
+    return cfg.get("model", {}).get("attn_implementation", None)
+# --------------------------
+# JSONL Logger Callback
+# --------------------------
+class JsonlLoggerCallback(TrainerCallback):
+    def __init__(self, run_dir: Path):
+        self.run_dir = run_dir
+        self.train_log_path = _ensure_dir(run_dir / "logs") / "train.jsonl"
+        self.eval_log_path = _ensure_dir(run_dir / "logs") / "eval.jsonl"
+        self.start_time = None
+    def _eta(self, global_step: int, max_steps: int) -> Optional[str]:
+        if self.start_time is None or global_step <= 0 or max_steps <= 0:
+            return None
+        elapsed = time.time() - self.start_time
+        sec_per_step = elapsed / global_step
+        remaining = max(0, max_steps - global_step) * sec_per_step
+        h = int(remaining // 3600)
+        m = int((remaining % 3600) // 60)
+        s = int(remaining % 60)
+        return f"{h:02d}:{m:02d}:{s:02d}"
+    def on_train_begin(self, args, state, control, **kwargs):
+        self.start_time = time.time()
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        max_steps = int(state.max_steps) if getattr(state, "max_steps", None) else 0
+        progress_pct = (100.0 * state.global_step / max_steps) if max_steps > 0 else None
+        epoch_pct = None
+        if state.epoch is not None and args.num_train_epochs and args.num_train_epochs > 0:
+            epoch_pct = 100.0 * (float(state.epoch) / float(args.num_train_epochs))
+        payload = {
+            "ts": _now_iso(),
+            "event": "train_log",
+            "step": int(state.global_step),
+            "epoch": round(float(state.epoch), 4) if state.epoch is not None else None,
+            "progress_pct": round(progress_pct, 2) if progress_pct is not None else None,
+            "epoch_pct": round(epoch_pct, 2) if epoch_pct is not None else None,
+            "eta": self._eta(int(state.global_step), max_steps),
+            "max_grad_norm": getattr(args, "max_grad_norm", None),
+            **logs,
+        }
+        with self.train_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+        if not metrics:
+            return
+        eval_loss = metrics.get("eval_loss", None)
+        ppl = _safe_exp(eval_loss) if eval_loss is not None else None
+        payload = {
+            "ts": _now_iso(),
+            "event": "eval",
+            "step": int(state.global_step),
+            "epoch": float(state.epoch) if state.epoch is not None else None,
+            **metrics,
+            "perplexity": ppl,
+        }
+        with self.eval_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+# --------------------------
+# Data Pipeline (EOS + Packing)
+# --------------------------
+def build_datasets(cfg: Dict[str, Any], tokenizer) -> Tuple[Any, Any]:
+    data_cfg = cfg["data"]
+    train_path = data_cfg["train_jsonl"]
+    eval_path = data_cfg.get("eval_jsonl", None)
+    split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
+    text_field = data_cfg.get("text_field", "text")
+    block_size = int(data_cfg.get("block_size", 2048))
+    shuffle = bool(data_cfg.get("shuffle", True))
+    num_proc = int(data_cfg.get("num_proc", 4))
+    pack_mode = str(data_cfg.get("pack_mode", "drop")).lower().strip()
+    if pack_mode not in ("drop", "pad"):
+        raise ValueError(f"data.pack_mode must be 'drop' or 'pad', got: {pack_mode}")
+    eos_id = tokenizer.eos_token_id
+    if eos_id is None:
+        raise ValueError("Tokenizer has no eos_token_id; CPT packing needs an EOS delimiter.")
+    if tokenizer.pad_token_id is None:
+        # safe default for many causal LMs
+        tokenizer.pad_token = tokenizer.eos_token
+    pad_id = tokenizer.pad_token_id
+    ds = load_dataset("json", data_files={"train": train_path})
+    if eval_path:
+        ds_eval = load_dataset("json", data_files={"eval": eval_path})
+        dsd = DatasetDict({"train": ds["train"], "eval": ds_eval["eval"]})
+    else:
+        if 0.0 < split_ratio < 1.0:
+            split = ds["train"].train_test_split(test_size=split_ratio, seed=int(cfg["run"].get("seed", 42)))
+            dsd = DatasetDict({"train": split["train"], "eval": split["test"]})
+        else:
+            dsd = DatasetDict({"train": ds["train"], "eval": None})
+    if text_field not in dsd["train"].column_names:
+        auto_field = _detect_text_field(dsd["train"][0])
+        if not auto_field:
+            raise ValueError(f"Could not find text field. Columns: {dsd['train'].column_names}")
+        text_field = auto_field
+    def tokenize_fn(examples):
+        out = tokenizer(
+            examples[text_field],
+            add_special_tokens=False,
+            truncation=False,
+            padding=False,
+        )
+        if "token_type_ids" in out:
+            del out["token_type_ids"]
+        # Add EOS between docs
+        out["input_ids"] = [ids + [eos_id] for ids in out["input_ids"]]
+        out["attention_mask"] = [m + [1] for m in out["attention_mask"]]
+        return out
+    tokenized_train = dsd["train"].map(
+        tokenize_fn,
+        batched=True,
+        num_proc=num_proc,
+        remove_columns=dsd["train"].column_names,
+        desc="Tokenizing train",
+    )
+    tokenized_eval = None
+    if dsd["eval"] is not None:
+        tokenized_eval = dsd["eval"].map(
+            tokenize_fn,
+            batched=True,
+            num_proc=num_proc,
+            remove_columns=dsd["eval"].column_names,
+            desc="Tokenizing eval",
+        )
+    def group_texts(examples):
+        concatenated = {k: sum(examples[k], []) for k in examples.keys()}
+        total_length = len(concatenated["input_ids"])
+        if total_length == 0:
+            return {"input_ids": [], "attention_mask": [], "labels": []}
+        full_len = (total_length // block_size) * block_size
+        blocks_input, blocks_attn, blocks_labels = [], [], []
+        # full blocks
+        for i in range(0, full_len, block_size):
+            chunk = concatenated["input_ids"][i:i + block_size]
+            attn = concatenated["attention_mask"][i:i + block_size]
+            blocks_input.append(chunk)
+            blocks_attn.append(attn)
+            blocks_labels.append(chunk.copy())
+        # remainder
+        remainder = total_length - full_len
+        if remainder > 0 and pack_mode == "pad":
+            chunk = concatenated["input_ids"][full_len:full_len + remainder]
+            attn = concatenated["attention_mask"][full_len:full_len + remainder]
+            pad_len = block_size - remainder
+            chunk_padded = chunk + [pad_id] * pad_len
+            attn_padded = attn + [0] * pad_len
+            labels = chunk_padded.copy()
+            labels[-pad_len:] = [-100] * pad_len  # loss mask
+            blocks_input.append(chunk_padded)
+            blocks_attn.append(attn_padded)
+            blocks_labels.append(labels)
+        return {
+            "input_ids": blocks_input,
+            "attention_mask": blocks_attn,
+            "labels": blocks_labels,
+        }
+    tokenized_train = tokenized_train.map(
+        group_texts,
+        batched=True,
+        num_proc=num_proc,
+        desc=f"Packing train blocks (mode={pack_mode})",
+    )
+    if tokenized_eval is not None:
+        tokenized_eval = tokenized_eval.map(
+            group_texts,
+            batched=True,
+            num_proc=num_proc,
+            desc=f"Packing eval blocks (mode={pack_mode})",
+        )
+    if len(tokenized_train) == 0:
+        raise ValueError(
+            "Train dataset is empty after packing. "
+            "Either increase data, reduce block_size, or set data.pack_mode='pad'."
+        )
+    if shuffle:
+        tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
+    return tokenized_train, tokenized_eval
+# --------------------------
+# Model Loading + PEFT
+# --------------------------
+def _select_model_loader(base_dir: Path):
+    cfg_path = base_dir / "config.json"
+    if not cfg_path.exists():
+        return {"kind": "causal", "arch": None}
+    with cfg_path.open("r", encoding="utf-8") as f:
+        cfg = json.load(f)
+    arch = cfg.get("architectures") or []
+    arch_name = arch[0] if arch else None
+    if any("ForConditionalGeneration" in a for a in arch):
+        return {"kind": "conditional", "arch": arch_name}
+    return {"kind": "causal", "arch": arch_name}
+def _resolve_model_class(arch_name: str):
+    import transformers
+    cls = getattr(transformers, arch_name, None)
+    if cls is None:
+        raise ValueError(f"Model class '{arch_name}' is not available in installed transformers.")
+    return cls
+def load_base_model_and_tokenizer(cfg: Dict[str, Any], base_dir: Path):
+    model_cfg = cfg["model"]
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    device_map = model_cfg.get("device_map", "auto")
+    tokenizer = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    torch_dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    quant_cfg = None
+    if use_4bit:
+        if BitsAndBytesConfig is None:
+            raise ImportError("BitsAndBytesConfig is not available in this transformers version.")
+        quant_cfg = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")),
+            bnb_4bit_use_double_quant=bool(model_cfg.get("bnb_4bit_use_double_quant", True)),
+            bnb_4bit_compute_dtype=_dtype_from_str(model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")),
+        )
+    attn_impl = _choose_attn_impl(cfg)
+    model_meta = _select_model_loader(base_dir)
+    try:
+        if model_meta["kind"] == "conditional":
+            model_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+            if model_cls is None:
+                raise ValueError("Conditional model architecture not specified in config.json.")
+            model = model_cls.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+                attn_implementation=attn_impl,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+                attn_implementation=attn_impl,
+            )
+    except Exception as e:
+        if attn_impl is not None:
+            print(f"[warn] attn_implementation='{attn_impl}' failed: {e}")
+            print("[warn] Falling back to default attention implementation.")
+        if model_meta["kind"] == "conditional":
+            model_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+            if model_cls is None:
+                raise ValueError("Conditional model architecture not specified in config.json.")
+            model = model_cls.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                str(base_dir),
+                device_map=device_map,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=True,
+                torch_dtype=(torch_dtype if not use_4bit else None),
+                quantization_config=quant_cfg,
+            )
+    return model, tokenizer
+def apply_peft(cfg: Dict[str, Any], model):
+    peft_cfg = cfg["peft"]
+    model_cfg = cfg["model"]
+    tr_cfg = cfg["train"]
+    if not bool(peft_cfg.get("enabled", True)):
+        return model, None
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
+    if gradient_checkpointing and hasattr(model, "gradient_checkpointing_enable"):
+        model.gradient_checkpointing_enable()
+        if hasattr(model, "config"):
+            model.config.use_cache = False
+    if use_4bit:
+        model = prepare_model_for_kbit_training(
+            model,
+            use_gradient_checkpointing=gradient_checkpointing,
+        )
+    target_modules = peft_cfg.get("target_modules", "auto")
+    if target_modules == "auto":
+        target_modules = _infer_target_modules(model)
+    lora_config = LoraConfig(
+        r=int(peft_cfg.get("r", 16)),
+        lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
+        lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
+        bias=str(peft_cfg.get("bias", "none")),
+        task_type="CAUSAL_LM",
+        target_modules=target_modules,
+    )
+    model = get_peft_model(model, lora_config)
+    return model, lora_config
+# --------------------------
+# Merge Logic
+# --------------------------
+def merge_adapter(cfg: Dict[str, Any], base_dir: Path, adapter_dir: Path, final_dir: Path):
+    print(f"--- Merge: {adapter_dir} + {base_dir} -> {final_dir} ---")
+    model_cfg = cfg["model"]
+    merge_cfg = cfg.get("merge", {})
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
+    max_shard_size = str(merge_cfg.get("max_shard_size", "2GB"))
+    model_meta = _select_model_loader(base_dir)
+    if model_meta["kind"] == "conditional":
+        base_cls = _resolve_model_class(model_meta["arch"]) if model_meta["arch"] else None
+        if base_cls is None:
+            raise ValueError("Conditional model architecture not specified in config.json.")
+        base = base_cls.from_pretrained(
+            str(base_dir),
+            torch_dtype=merged_dtype,
+            device_map="cpu",
+            low_cpu_mem_usage=True,
+            trust_remote_code=trust_remote_code,
+        )
+    else:
+        base = AutoModelForCausalLM.from_pretrained(
+            str(base_dir),
+            torch_dtype=merged_dtype,
+            device_map="cpu",
+            low_cpu_mem_usage=True,
+            trust_remote_code=trust_remote_code,
+        )
+    merged = PeftModel.from_pretrained(base, str(adapter_dir))
+    merged = merged.merge_and_unload()
+    _ensure_dir(final_dir)
+    # Fix for transformers weight conversion bug with quantized models
+    # Clear weight conversions to avoid NotImplementedError in reverse_transform
+    if hasattr(merged, '_weight_conversions'):
+        merged._weight_conversions = []
+    merged.save_pretrained(
+        str(final_dir),
+        safe_serialization=True,
+        max_shard_size=max_shard_size
+    )
+    tok = _load_tokenizer(base_dir, use_fast=use_fast, trust_remote_code=trust_remote_code)
+    if tok.pad_token is None:
+        tok.pad_token = tok.eos_token
+    tok.save_pretrained(str(final_dir))
+    print("--- Merge complete ---")
+# --------------------------
+# Main
+# --------------------------
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--config", required=True, help="Path to YAML config")
+    ap.add_argument("--merge-only", action="store_true", help="Skip training, just merge adapter")
+    args = ap.parse_args()
+    with open(args.config, "r", encoding="utf-8") as f:
+        cfg = yaml.safe_load(f)
+    run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
+    _ensure_dir(run_dir / "logs")
+    with (run_dir / "config_resolved.yaml").open("w", encoding="utf-8") as f:
+        yaml.safe_dump(cfg, f, sort_keys=False)
+    model_cfg = cfg["model"]
+    repo_id = str(model_cfg["repo_id"]).strip()
+    repo_path = Path(repo_id)
+    # ✅ Local model path -> load directly; no download
+    if repo_path.exists() and repo_path.is_dir():
+        base_dir = repo_path
+        if not _looks_like_model_dir(base_dir):
+            raise ValueError(f"model.repo_id points to a directory, but it doesn't look like a HF model dir: {base_dir}")
+    else:
+        # HF repo_id -> download into run_dir/base_local_dir
+        base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
+        if not _looks_like_model_dir(base_dir):
+            print(f"Base model not found at {base_dir}, downloading from {repo_id} ...")
+            snapshot_download(
+                repo_id=repo_id,
+                revision=model_cfg.get("revision", None),
+                local_dir=str(base_dir),
+                local_dir_use_symlinks=False,
+            )
+    ckpt_dir = _ensure_dir(run_dir / "checkpoints")
+    best_adapter_dir = _ensure_dir(run_dir / "best_adapter")
+    merge_cfg = cfg.get("merge", {}) or {}
+    if merge_cfg.get("output_dir"):
+        od = Path(str(merge_cfg["output_dir"]))
+        final_dir = od if od.is_absolute() else (run_dir / od)
+    else:
+        final_dir = run_dir / "final_model"
+    # Merge-only
+    if args.merge_only:
+        if not _looks_like_model_dir(best_adapter_dir):
+            raise FileNotFoundError(f"Adapter not found at {best_adapter_dir}")
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+        return
+    # Training
+    set_seed(int(cfg["run"].get("seed", 42)))
+    model, tokenizer = load_base_model_and_tokenizer(cfg, base_dir)
+    model, _ = apply_peft(cfg, model)
+    train_ds, eval_ds = build_datasets(cfg, tokenizer)
+    tr_cfg = cfg["train"]
+    dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_fp16 = (dtype == torch.float16)
+    use_bf16 = (dtype == torch.bfloat16)
+    max_steps = int(tr_cfg.get("max_steps", 0))
+    num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
+    # --- Dynamic evaluation strategy parameter handling ---
+    ta_params = inspect.signature(TrainingArguments.__init__).parameters
+    eval_key = "eval_strategy" if "eval_strategy" in ta_params else "evaluation_strategy"
+    desired_ta_kwargs = dict(
+        output_dir=str(ckpt_dir),
+        max_steps=max_steps if max_steps > 0 else -1,
+        num_train_epochs=num_train_epochs,
+        per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)),
+        per_device_eval_batch_size=int(tr_cfg.get("per_device_eval_batch_size", tr_cfg.get("per_device_train_batch_size", 1))),
+        gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)),
+        learning_rate=float(tr_cfg.get("learning_rate", 2e-5)),
+        weight_decay=float(tr_cfg.get("weight_decay", 0.0)),
+        warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)),
+        lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")),
+        optim=str(tr_cfg.get("optim", "paged_adamw_8bit" if bool(model_cfg.get("use_4bit", False)) else "adamw_torch")),
+        max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)),
+        logging_steps=int(tr_cfg.get("logging_steps", 10)),
+        save_strategy=str(tr_cfg.get("save_strategy", "steps")),
+        save_steps=int(tr_cfg.get("save_steps", 200)),
+        save_total_limit=int(tr_cfg.get("save_total_limit", 3)),
+        eval_steps=int(tr_cfg.get("eval_steps", 200)),
+        load_best_model_at_end=bool(tr_cfg.get("load_best_model_at_end", True)) if eval_ds is not None else False,
+        metric_for_best_model="eval_loss",
+        greater_is_better=False,
+        fp16=use_fp16,
+        bf16=use_bf16,
+        report_to=[],
+        remove_unused_columns=False,
+        save_safetensors=True,
+        overwrite_output_dir=False,
+    )
+    # Set the correct argument name for this transformers version
+    desired_ta_kwargs[eval_key] = str(tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no"))
+    ta_kwargs = {k: v for k, v in desired_ta_kwargs.items() if k in ta_params}
+    training_args = TrainingArguments(**ta_kwargs)
+    trainer_params = inspect.signature(Trainer.__init__).parameters
+    desired_trainer_kwargs = dict(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=eval_ds,
+        tokenizer=tokenizer,
+        processing_class=tokenizer,
+        data_collator=default_data_collator,
+        callbacks=[JsonlLoggerCallback(run_dir)],
+    )
+    trainer_kwargs = {k: v for k, v in desired_trainer_kwargs.items() if k in trainer_params}
+    trainer = Trainer(**trainer_kwargs)
+    # Resume
+    resume_from = tr_cfg.get("resume_from_checkpoint", None)
+    if resume_from == "auto":
+        last = get_last_checkpoint(str(ckpt_dir))
+        resume_from = last if last else None
+        if resume_from:
+            print(f"Resuming from {resume_from}")
+    print("Starting training...")
+    trainer.train(resume_from_checkpoint=resume_from)
+    trainer.save_model(str(best_adapter_dir))
+    print(f"Saved best adapter -> {best_adapter_dir}")
+    if eval_ds is not None:
+        metrics = trainer.evaluate()
+        eval_loss = metrics.get("eval_loss", None)
+        metrics["perplexity"] = _safe_exp(eval_loss) if eval_loss is not None else None
+        with (run_dir / "eval_final.json").open("w", encoding="utf-8") as f:
+            json.dump(metrics, f, indent=2)
+        print(f"Final eval_loss={eval_loss}, ppl={metrics['perplexity']}")
+    if bool(cfg.get("merge", {}).get("enabled", False)):
+        del trainer, model
+        torch.cuda.empty_cache()
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+    else:
+        print("Merge disabled. Run with --merge-only later if needed.")
+if __name__ == "__main__":
+    main()

trainer-kit/SFT-14b/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

trainer-kit/SFT-14b/config_instruct.yaml ADDED Viewed

	@@ -0,0 +1,144 @@

+run:
+  run_dir: "./runs/instruct_run_14b_v1"
+  seed: 42
+# WandB integration for experiment tracking
+wandb:
+  enabled: true  # Set to true to enable wandb logging
+  project: "sft-training"  # WandB project name
+  entity: null  # WandB entity/team (optional)
+  name: null  # Run name (optional, will auto-generate if null)
+  tags: ["sft-lora", "instruction-tuning"]  # List of tags for the run (e.g., ["lora", "qlora", "experiment-1"])
+  notes: null  # Run description/notes (optional)
+model:
+  # Use local Qwen2.5-Coder-14B model
+  repo_id: "./runs/cpt_run_14b/merged_14b_cpt_lora"
+  revision: null
+  # Used only when repo_id is a HF repo (not a local path)
+  base_local_dir: "base_model"
+  trust_remote_code: true
+  tokenizer_use_fast: true
+  device_map: "auto"
+  torch_dtype: "bfloat16"  # "float16" | "bfloat16" | "float32"
+  # QLoRA
+  use_4bit: false
+  bnb_4bit_quant_type: "nf4"
+  bnb_4bit_use_double_quant: false
+  bnb_4bit_compute_dtype: "bfloat16"
+  # optional: "flash_attention_2" | "sdpa" | null
+  attn_implementation: null
+data:
+  train_jsonl: "sft_dataset.jsonl"
+  eval_jsonl: null
+  eval_split_ratio: 0.1
+  # Field names in your JSONL data
+  instruction_field: "instruction"  # This will be the system prompt
+  input_field: "input"             # This is the task description
+  output_field: "output"           # This is the analysis + selection
+  # Formatting options
+  format_type: "custom"  # "chatml" | "alpaca" | "custom"
+  # For chatml format
+  system_prompt: |
+    You are a Hyperswitch Rust code analyzer. Identify functions/structs that need modification for a given task.
+    ## Output Format
+    ##OUTPUT
+    Explain the data flow and why each component must change:
+    - Flow: [Input → Processing → Output with arrows]
+    - For each component: "The [ComponentName] ([path]) must [action] because [reason]—without this, [consequence]"
+    - Explain coupling between components
+    ##SELECT
+    modify::crates/path/to/file.rs::impl::ComponentName
+    add::crates/another/file.rs::function::AnotherComponent
+    <EOS>
+    ## Rules
+    1. Use full paths: `remove::crates/folder/file.rs::Type::Name`
+    2. Use `::` for nested items: `status::StructName::Type::Name`
+    3. Always explain "must change because" and "without this"
+    3. Types of components: function, struct, enum, impl, trait
+    4. If there is extra information (e.g., enum variants), include that too.
+    5. Start with ##OUTPUT, end with ##SELECT, terminate with <EOS>
+    ## Example
+    ##TASK
+    Add webhook subscription support
+    ##OUTPUT
+    The webhook system routes events via EventClass enum. Flow: webhook → EventClass → handler → processing. The EventClass enum (crates/common_enums/src/enums.rs::EventClass) must add Subscriptions variant because it defines event routing—without this, subscription events cannot be processed. The SubscriptionStatus impl (crates/common_enums/src/transformers.rs::SubscriptionStatus) must map to EventType because it converts status to events—without this, status changes don't trigger webhooks. These are coupled: EventClass routes to handlers that use SubscriptionStatus mappings.
+    ##SELECT
+    crates/common_enums/src/enums.rs::EventClass
+    crates/common_enums/src/transformers.rs::SubscriptionStatus
+    <EOS>
+  # For custom format (only used when format_type="custom")
+  custom_template: "##INSTRUCTION\n{instruction}<|im_end|>\n##TASK\n{input}<|im_end|>\n##OUTPUT\n{output}<|im_end|>"
+  max_length: 2048
+  shuffle: true
+  num_proc: 4
+peft:
+  enabled: true
+  r: 16
+  lora_alpha: 32
+  lora_dropout: 0.05
+  bias: "none"
+  target_modules: "auto"
+train:
+  # max_steps: 10
+  num_train_epochs: 6
+  per_device_train_batch_size: 1
+  per_device_eval_batch_size: 1
+  gradient_accumulation_steps: 8
+  learning_rate: 2e-4
+  weight_decay: 0.0
+  warmup_ratio: 0.08
+  lr_scheduler_type: "cosine"
+  optim: "adamw_torch"  # ✅ Changed from paged_adamw_8bit (requires use_4bit=true)
+  max_grad_norm: 1.0
+  gradient_checkpointing: true
+  logging_steps: 2
+  save_strategy: "steps"
+  save_steps: 500
+  save_total_limit: 20
+  evaluation_strategy: "steps"
+  eval_steps: 100
+  load_best_model_at_end: true
+  # Early stopping
+  early_stopping:
+    enabled: true
+    patience: 3  # Number of evaluations with no improvement before stopping
+    min_delta: 0.001  # Minimum change to qualify as improvement
+    metric: "eval_loss"  # Metric to monitor
+    mode: "min"  # "min" for loss, "max" for accuracy/etc.
+  resume_from_checkpoint: "auto"
+merge:
+  enabled: true
+  merged_dtype: "float16"
+  max_shard_size: "2GB"
+  output_dir: "./merged_14b_instruct_lora"

trainer-kit/SFT-14b/instruct_data.jsonl ADDED Viewed

	@@ -0,0 +1,4 @@

+{"input": "##TASK\nAdd subscription support to webhook event metadata and outgoing webhook content systems to enable proper subscription lifecycle event handling and webhook delivery for subscription-related operations.", "output": "##OUTPUT\nThe webhook system requires subscription event support to handle subscription lifecycle events (creation, payment, cancellation) and deliver them to merchants via outgoing webhooks. This affects the event metadata enumeration and webhook content transformation layers. The complete flow is: subscription event occurs → EventMetadata::Subscription variant created with subscription_id, invoice_id, payment_id → stored in database → retrieved for webhook delivery → OutgoingWebhookEventContent::foreign_from converts EventMetadata to OutgoingWebhookEventContent::Subscription → StripeWebhookObject::From transforms to StripeWebhookObject::Subscriptions for Stripe compatibility. The OutgoingWebhookEventContent::foreign_from function (router/src/core/webhooks/outgoing_v2.rs::OutgoingWebhookEventContent::foreign_from) must add the Subscription variant match arm because it's the bridge from storage EventMetadata to webhook content representation—without this, subscription events cannot be converted to webhook content and would be silently dropped during webhook processing. The StripeWebhookObject::From implementation (router/src/compatibility/stripe/webhooks.rs::StripeWebhookObject::From) must handle OutgoingWebhookContent::SubscriptionDetails because it provides Stripe-compatible webhook formatting—without this, merchants using Stripe webhook endpoints would receive malformed or missing subscription events. The EventMetadata enum (implicitly referenced in foreign_from) requires the Subscription variant to exist in the diesel_models layer, which enables persistent storage of subscription event references. These components form a tightly coupled webhook pipeline where each depends on the previous: EventMetadata → OutgoingWebhookEventContent → StripeWebhookObject. If foreign_from lacks the Subscription match arm, subscription events stored as EventMetadata::Subscription would cause panic during webhook delivery. If StripeWebhookObject::From lacks SubscriptionDetails handling, Stripe webhook endpoints would receive unparseable responses, breaking merchant integrations that rely on Stripe-compatible subscription webhooks for revenue recognition and customer management systems.\n\n##SELECT\ncrates/router/src/core/webhooks/outgoing_v2.rs::OutgoingWebhookEventContent::foreign_from\ncrates/router/src/compatibility/stripe/webhooks.rs::StripeWebhookObject::From\n<EOS>"}
+{"input": "##TASK\nUpdate subscription response mapping to use common_enums::SubscriptionStatus instead of subscription_types::SubscriptionStatus for consistent enum usage across the codebase.", "output": "##OUTPUT\nThe subscription system needs to standardize on a single SubscriptionStatus enum type to avoid type inconsistencies and import conflicts. The current flow is: subscription handler creates SubscriptionResponse → calls to_subscription_response method → converts string status to subscription_types::SubscriptionStatus → returns response. The to_subscription_response method (crates/subscriptions/src/core/subscription_handler.rs::SubscriptionWithHandler::to_subscription_response) must change because it's the conversion point where the status enum is instantiated. Using common_enums::SubscriptionStatus ensures consistency with other parts of the system that already use this common enum type. Without this change, the subscription module would maintain its own duplicate enum, creating potential mismatches and requiring conversions throughout the codebase when interacting with other modules that expect the common enum type.\n\n##SELECT\ncrates/subscriptions/src/core/subscription_handler.rs::SubscriptionWithHandler::to_subscription_response\n<EOS>"}
+{"input": "##TASK\nAdd subscription support to webhook event system by including Subscriptions in EventClass enum and implementing SubscriptionStatus to EventType mapping for proper webhook routing.", "output": "##OUTPUT\nThe webhook system needs to handle subscription-related events for proper event classification and routing. The current flow is: webhook event received → EventClass determines routing category → EventType specifies the exact event type → handlers process based on event type. The EventClass enum (crates/common_enums/src/enums.rs::EventClass) must include Subscriptions because it's the primary classification that determines which subsystem handles the event. The SubscriptionStatus transformer (crates/common_enums/src/transformers.rs::impl::Option<EventType>) must map subscription status changes to appropriate EventType values because it converts domain state changes into webhook events. Without EventClass::Subscriptions, subscription events would be misclassified or unhandled. Without the SubscriptionStatus mapping, subscription status changes wouldn't trigger appropriate webhook notifications, breaking integration with external systems that depend on subscription lifecycle events.\n\n##SELECT\ncrates/common_enums/src/enums.rs::EventClass\ncrates/common_enums/src/transformers.rs::impl::Option<EventType>\n<EOS>"}
+{"input": "##TASK\nAdd subscription ID support to webhook object reference system for proper webhook payload identification and routing.", "output": "##OUTPUT\nThe webhook system needs to reference subscription entities in event payloads for proper event correlation and processing. The current flow is: webhook event generated → ObjectReferenceId identifies the affected entity → webhook payload includes reference → consumers process based on entity type. The ObjectReferenceId enum (crates/api_models/src/webhooks.rs::ObjectReferenceId) must include SubscriptionId because it's the type-safe identifier used throughout the webhook payload structure to specify which subscription triggered the event. Without SubscriptionId, webhook events related to subscriptions couldn't properly reference the subscription entity, making it impossible for consumers to correlate events with specific subscriptions. This would break webhook consumers that need to update their local state or trigger business logic based on subscription events.\n\n##SELECT\ncrates/api_models/src/webhooks.rs::ObjectReferenceId\n<EOS>"}

trainer-kit/SFT-14b/requirements.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+# Core
+torch>=2.1.0
+transformers>=4.41.0
+datasets>=2.18.0
+accelerate>=0.30.0
+# PEFT / QLoRA
+peft>=0.11.1
+bitsandbytes>=0.43.1
+# Hugging Face Hub (local + download support)
+huggingface_hub>=0.23.0
+# Config + utilities
+pyyaml>=6.0
+tqdm>=4.66.0
+# Optional but recommended (tokenizers speed)
+tokenizers>=0.15.0
+safetensors>=0.4.2
+# Experiment tracking
+wandb>=0.16.0

trainer-kit/SFT-14b/run_instruct.py ADDED Viewed

	@@ -0,0 +1,844 @@

+import argparse
+import json
+import inspect  # Added for Transformers version compatibility
+import math
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple, List
+import torch
+import yaml
+from datasets import load_dataset, DatasetDict
+from huggingface_hub import snapshot_download
+from transformers import (
+    AutoTokenizer,
+    AutoModelForCausalLM,
+    BitsAndBytesConfig,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    EarlyStoppingCallback,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from peft import (
+    LoraConfig,
+    get_peft_model,
+    prepare_model_for_kbit_training,
+    PeftModel,
+)
+try:
+    import wandb
+    WANDB_AVAILABLE = True
+except ImportError:
+    WANDB_AVAILABLE = False
+    wandb = None
+# --------------------------
+# Helpers
+# --------------------------
+def _dtype_from_str(s: str) -> torch.dtype:
+    s = (s or "").lower()
+    if s in ("float16", "fp16"):
+        return torch.float16
+    if s in ("bfloat16", "bf16"):
+        return torch.bfloat16
+    if s in ("float32", "fp32"):
+        return torch.float32
+    raise ValueError(f"Unknown torch_dtype: {s}")
+def _now_iso() -> str:
+    return time.strftime("%Y-%m-%dT%H:%M:%S", time.localtime())
+def _safe_exp(x: float) -> float:
+    x = min(float(x), 50.0)
+    return float(math.exp(x))
+def _ensure_dir(p: Path) -> Path:
+    p.mkdir(parents=True, exist_ok=True)
+    return p
+def _looks_like_model_dir(p: Path) -> bool:
+    if not p.exists() or not p.is_dir():
+        return False
+    if (p / "config.json").exists():
+        return True
+    if any(p.glob("*.safetensors")) or any(p.glob("pytorch_model*.bin")):
+        return True
+    return False
+def _infer_target_modules(model) -> List[str]:
+    names = set()
+    for n, _ in model.named_modules():
+        names.add(n.split(".")[-1])
+    for group in [
+        ["q_proj", "k_proj", "v_proj", "o_proj"],
+        ["Wqkv", "out_proj"],
+        ["query_key_value", "dense"],
+        ["c_attn", "c_proj"],
+    ]:
+        if all(x in names for x in group):
+            return group
+    fallback = [
+        x
+        for x in [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "c_attn",
+            "c_proj",
+            "out_proj",
+            "dense",
+        ]
+        if x in names
+    ]
+    if fallback:
+        return fallback
+    raise ValueError(
+        "Could not auto-infer target_modules. Set peft.target_modules explicitly."
+    )
+def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
+    return cfg.get("model", {}).get("attn_implementation", None)
+# --------------------------
+# Wandb Integration
+# --------------------------
+def setup_wandb(cfg: Dict[str, Any], run_dir: Path):
+    """Initialize Wandb if enabled in configuration."""
+    wandb_cfg = cfg.get("wandb", {})
+    if not wandb_cfg.get("enabled", False):
+        print("Wandb logging disabled")
+        return None
+    if not WANDB_AVAILABLE:
+        print("Wandb not available. Install with: pip install wandb")
+        return None
+    # Extract wandb configuration
+    project = wandb_cfg.get("project", "sft-training")
+    entity = wandb_cfg.get("entity", None)
+    name = wandb_cfg.get("name", None)
+    tags = wandb_cfg.get("tags", [])
+    notes = wandb_cfg.get("notes", None)
+    # Initialize wandb
+    try:
+        wandb.init(
+            project=project,
+            entity=entity,
+            name=name,
+            tags=tags,
+            notes=notes,
+            dir=str(run_dir),
+            config={
+                "model": cfg.get("model", {}),
+                "data": cfg.get("data", {}),
+                "peft": cfg.get("peft", {}),
+                "train": cfg.get("train", {}),
+                "run_dir": str(run_dir),
+            }
+        )
+        print(f"Wandb initialized: project='{project}', name='{name or 'auto-generated'}'")
+        return wandb
+    except Exception as e:
+        print(f"Failed to initialize Wandb: {e}")
+        return None
+def finish_wandb():
+    """Finish Wandb run if active."""
+    if WANDB_AVAILABLE and wandb.run is not None:
+        wandb.finish()
+        print("Wandb run finished")
+# --------------------------
+# JSONL Logger Callback
+# --------------------------
+class JsonlLoggerCallback(TrainerCallback):
+    def __init__(self, run_dir: Path):
+        self.run_dir = run_dir
+        self.train_log_path = _ensure_dir(run_dir / "logs") / "train.jsonl"
+        self.eval_log_path = _ensure_dir(run_dir / "logs") / "eval.jsonl"
+        self.start_time = None
+    def _eta(self, global_step: int, max_steps: int) -> Optional[str]:
+        if self.start_time is None or global_step <= 0 or max_steps <= 0:
+            return None
+        elapsed = time.time() - self.start_time
+        sec_per_step = elapsed / global_step
+        remaining = max(0, max_steps - global_step) * sec_per_step
+        h = int(remaining // 3600)
+        m = int((remaining % 3600) // 60)
+        s = int(remaining % 60)
+        return f"{h:02d}:{m:02d}:{s:02d}"
+    def on_train_begin(self, args, state, control, **kwargs):
+        self.start_time = time.time()
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        max_steps = int(state.max_steps) if getattr(state, "max_steps", None) else 0
+        progress_pct = (
+            (100.0 * state.global_step / max_steps) if max_steps > 0 else None
+        )
+        epoch_pct = None
+        if (
+            state.epoch is not None
+            and args.num_train_epochs
+            and args.num_train_epochs > 0
+        ):
+            epoch_pct = 100.0 * (float(state.epoch) / float(args.num_train_epochs))
+        payload = {
+            "ts": _now_iso(),
+            "event": "train_log",
+            "step": int(state.global_step),
+            "epoch": round(float(state.epoch), 4) if state.epoch is not None else None,
+            "progress_pct": (
+                round(progress_pct, 2) if progress_pct is not None else None
+            ),
+            "epoch_pct": round(epoch_pct, 2) if epoch_pct is not None else None,
+            "eta": self._eta(int(state.global_step), max_steps),
+            "max_grad_norm": getattr(args, "max_grad_norm", None),
+            **logs,
+        }
+        with self.train_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+        if not metrics:
+            return
+        eval_loss = metrics.get("eval_loss", None)
+        ppl = _safe_exp(eval_loss) if eval_loss is not None else None
+        payload = {
+            "ts": _now_iso(),
+            "event": "eval",
+            "step": int(state.global_step),
+            "epoch": float(state.epoch) if state.epoch is not None else None,
+            **metrics,
+            "perplexity": ppl,
+        }
+        with self.eval_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+# --------------------------
+# Data Pipeline (Instruction Formatting)
+# --------------------------
+def format_instruction(
+    example: Dict[str, Any], cfg: Dict[str, Any], tokenizer
+) -> Dict[str, Any]:
+    """
+    Format instruction data for training.
+    Supports multiple formats: chatml, alpaca, custom templates.
+    Returns both formatted text and the response start position for loss masking.
+    """
+    data_cfg = cfg["data"]
+    format_type = data_cfg.get("format_type", "chatml")
+    # Get field names from config
+    input_field = data_cfg.get("input_field", "input")
+    output_field = data_cfg.get("output_field", "output")
+    instruction_field = data_cfg.get("instruction_field", "instruction")
+    # Extract text from example
+    instruction = example.get(instruction_field, "")
+    input_text = example.get(input_field, "")
+    output_text = example.get(output_field, "")
+    if format_type == "chatml":
+        # ChatML format with special tokens
+        system_prompt = data_cfg.get("system_prompt", "You are a helpful assistant.")
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        user_content = instruction
+        if input_text:
+            user_content = f"{instruction}\n\n{input_text}"
+        messages.append({"role": "user", "content": user_content})
+        messages.append({"role": "assistant", "content": output_text})
+        # Apply chat template
+        formatted_text = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=False
+        )
+        # Add EOS token if not present
+        if tokenizer.eos_token and not formatted_text.endswith(tokenizer.eos_token):
+            formatted_text += tokenizer.eos_token
+        # Find where the assistant response starts for loss masking
+        # Try multiple possible markers for robustness
+        markers = ["<|im_start|>assistant", "<|assistant|>", "Assistant:", "assistant\n"]
+        response_start_pos = -1
+        for marker in markers:
+            idx = formatted_text.find(marker)
+            if idx != -1:
+                # Find the newline after the marker
+                newline_idx = formatted_text.find("\n", idx)
+                if newline_idx != -1:
+                    response_start_pos = newline_idx + 1
+                    break
+        # Fallback: find where the actual output starts
+        if response_start_pos == -1:
+            output_idx = formatted_text.find(output_text)
+            if output_idx != -1:
+                response_start_pos = output_idx
+            else:
+                # Last resort: split at last occurrence of newline before end
+                response_start_pos = formatted_text.rfind("\n", 0, len(formatted_text) - len(output_text)) + 1
+    elif format_type == "alpaca":
+        # Alpaca format
+        if input_text:
+            prefix = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
+        else:
+            prefix = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n"
+        formatted_text = prefix + output_text
+        # Add EOS token
+        if tokenizer.eos_token:
+            formatted_text += tokenizer.eos_token
+        # Response starts after the prefix
+        response_start_pos = len(prefix)
+    elif format_type == "custom":
+        # Custom template from config
+        template = data_cfg.get("custom_template", "{instruction}\n{input}\n{output}")
+        # For custom format, use system_prompt as instruction if instruction field is empty
+        if not instruction:
+            instruction = data_cfg.get("system_prompt", "")
+        # For custom templates, we need to find where {output} starts
+        template_parts = template.split("{output}")
+        prefix = template_parts[0].format(instruction=instruction, input=input_text)
+        formatted_text = prefix + output_text
+        # Add EOS token if not already in template
+        if tokenizer.eos_token and not formatted_text.endswith(tokenizer.eos_token):
+            formatted_text += tokenizer.eos_token
+        # Response starts after the prefix
+        response_start_pos = len(prefix)
+    else:
+        raise ValueError(f"Unsupported format_type: {format_type}")
+    return {"text": formatted_text, "response_start_pos": response_start_pos}
+def build_datasets(cfg: Dict[str, Any], tokenizer) -> Tuple[Any, Any]:
+    """
+    Build datasets for instruction fine-tuning.
+    """
+    data_cfg = cfg["data"]
+    train_path = data_cfg["train_jsonl"]
+    eval_path = data_cfg.get("eval_jsonl", None)
+    split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
+    max_length = int(data_cfg.get("max_length", 2048))
+    shuffle = bool(data_cfg.get("shuffle", True))
+    num_proc = int(data_cfg.get("num_proc", 4))
+    # Ensure tokenizer has pad token
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load datasets
+    ds = load_dataset("json", data_files={"train": train_path})
+    if eval_path:
+        ds_eval = load_dataset("json", data_files={"eval": eval_path})
+        dsd = DatasetDict({"train": ds["train"], "eval": ds_eval["eval"]})
+    else:
+        if 0.0 < split_ratio < 1.0:
+            split = ds["train"].train_test_split(
+                test_size=split_ratio, seed=int(cfg["run"].get("seed", 42))
+            )
+            dsd = DatasetDict({"train": split["train"], "eval": split["test"]})
+        else:
+            dsd = DatasetDict({"train": ds["train"], "eval": None})
+    # Format instructions and track response start positions
+    def format_fn(examples):
+        formatted_examples = []
+        response_start_positions = []
+        for i in range(len(examples[list(examples.keys())[0]])):
+            example = {k: examples[k][i] for k in examples.keys()}
+            formatted = format_instruction(example, cfg, tokenizer)
+            formatted_examples.append(formatted["text"])
+            response_start_positions.append(formatted["response_start_pos"])
+        return {
+            "text": formatted_examples,
+            "response_start_pos": response_start_positions
+        }
+    formatted_train = dsd["train"].map(
+        format_fn,
+        batched=True,
+        num_proc=num_proc,
+        remove_columns=dsd["train"].column_names,
+        desc="Formatting train instructions",
+    )
+    formatted_eval = None
+    if dsd["eval"] is not None:
+        formatted_eval = dsd["eval"].map(
+            format_fn,
+            batched=True,
+            num_proc=num_proc,
+            remove_columns=dsd["eval"].column_names,
+            desc="Formatting eval instructions",
+        )
+    # Tokenize and apply loss masking
+    def tokenize_and_mask_fn(examples):
+        tokenized = tokenizer(
+            examples["text"],
+            truncation=True,
+            padding=False,
+            max_length=max_length,
+            return_overflowing_tokens=False,
+        )
+        # Apply loss masking - CRITICAL for SFT
+        labels = []
+        attention_masks = []
+        for i in range(len(tokenized["input_ids"])):
+            input_ids = tokenized["input_ids"][i]
+            response_start_pos = examples["response_start_pos"][i]
+            # Get the instruction part (before response)
+            full_text = examples["text"][i]
+            instruction_text = full_text[:response_start_pos]
+            # Create labels masked by default
+            label_ids = [-100] * len(input_ids)
+            # Find where response starts using character-based ratio
+            # This is more reliable than tokenizing prefix separately
+            # because separate tokenization can add different special tokens
+            char_ratio = response_start_pos / max(len(full_text), 1)
+            response_start_idx = int(len(input_ids) * char_ratio)
+            # Ensure we have valid bounds (at least position 1, at most len-1)
+            response_start_idx = max(1, min(response_start_idx, len(input_ids) - 1))
+            # Unmask response tokens (including EOS)
+            for j in range(response_start_idx, len(input_ids)):
+                label_ids[j] = input_ids[j]
+            # Create attention mask (1 for real tokens, 0 for padding)
+            attention_mask = [1] * len(input_ids)
+            labels.append(label_ids)
+            attention_masks.append(attention_mask)
+        tokenized["labels"] = labels
+        tokenized["attention_mask"] = attention_masks
+        return tokenized
+    tokenized_train = formatted_train.map(
+        tokenize_and_mask_fn,
+        batched=True,
+        num_proc=num_proc,
+        desc="Tokenizing and masking train",
+    )
+    tokenized_eval = None
+    if formatted_eval is not None:
+        tokenized_eval = formatted_eval.map(
+            tokenize_and_mask_fn,
+            batched=True,
+            num_proc=num_proc,
+            desc="Tokenizing and masking eval",
+        )
+    if shuffle:
+        tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
+    return tokenized_train, tokenized_eval
+# --------------------------
+# Model Loading + PEFT
+# --------------------------
+def load_base_model_and_tokenizer(cfg: Dict[str, Any], base_dir: Path):
+    model_cfg = cfg["model"]
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    device_map = model_cfg.get("device_map", "auto")
+    tokenizer = AutoTokenizer.from_pretrained(
+        str(base_dir),
+        use_fast=use_fast,
+        trust_remote_code=trust_remote_code,
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    torch_dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    quant_cfg = None
+    if use_4bit:
+        quant_cfg = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")),
+            bnb_4bit_use_double_quant=bool(
+                model_cfg.get("bnb_4bit_use_double_quant", True)
+            ),
+            bnb_4bit_compute_dtype=_dtype_from_str(
+                model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")
+            ),
+        )
+    attn_impl = _choose_attn_impl(cfg)
+    try:
+        model = AutoModelForCausalLM.from_pretrained(
+            str(base_dir),
+            device_map=device_map,
+            trust_remote_code=trust_remote_code,
+            low_cpu_mem_usage=True,
+            torch_dtype=(torch_dtype if not use_4bit else None),
+            quantization_config=quant_cfg,
+            attn_implementation=attn_impl,
+        )
+    except Exception as e:
+        if attn_impl is not None:
+            print(f"[warn] attn_implementation='{attn_impl}' failed: {e}")
+            print("[warn] Falling back to default attention implementation.")
+        model = AutoModelForCausalLM.from_pretrained(
+            str(base_dir),
+            device_map=device_map,
+            trust_remote_code=trust_remote_code,
+            low_cpu_mem_usage=True,
+            torch_dtype=(torch_dtype if not use_4bit else None),
+            quantization_config=quant_cfg,
+        )
+    return model, tokenizer
+def apply_peft(cfg: Dict[str, Any], model):
+    peft_cfg = cfg["peft"]
+    model_cfg = cfg["model"]
+    tr_cfg = cfg["train"]
+    if not bool(peft_cfg.get("enabled", True)):
+        return model, None
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
+    if gradient_checkpointing and hasattr(model, "gradient_checkpointing_enable"):
+        model.gradient_checkpointing_enable()
+        if hasattr(model, "config"):
+            model.config.use_cache = False
+    if use_4bit:
+        model = prepare_model_for_kbit_training(
+            model,
+            use_gradient_checkpointing=gradient_checkpointing,
+        )
+    target_modules = peft_cfg.get("target_modules", "auto")
+    if target_modules == "auto":
+        target_modules = _infer_target_modules(model)
+    lora_config = LoraConfig(
+        r=int(peft_cfg.get("r", 16)),
+        lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
+        lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
+        bias=str(peft_cfg.get("bias", "none")),
+        task_type="CAUSAL_LM",
+        target_modules=target_modules,
+    )
+    model = get_peft_model(model, lora_config)
+    return model, lora_config
+# --------------------------
+# Merge Logic
+# --------------------------
+def merge_adapter(
+    cfg: Dict[str, Any], base_dir: Path, adapter_dir: Path, final_dir: Path
+):
+    print(f"--- Merge: {adapter_dir} + {base_dir} -> {final_dir} ---")
+    model_cfg = cfg["model"]
+    merge_cfg = cfg.get("merge", {})
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
+    max_shard_size = str(merge_cfg.get("max_shard_size", "2GB"))
+    base = AutoModelForCausalLM.from_pretrained(
+        str(base_dir),
+        torch_dtype=merged_dtype,
+        device_map="cpu",
+        low_cpu_mem_usage=True,
+        trust_remote_code=trust_remote_code,
+    )
+    merged = PeftModel.from_pretrained(base, str(adapter_dir))
+    merged = merged.merge_and_unload()
+    _ensure_dir(final_dir)
+    merged.save_pretrained(
+        str(final_dir), safe_serialization=True, max_shard_size=max_shard_size
+    )
+    tok = AutoTokenizer.from_pretrained(
+        str(base_dir), trust_remote_code=trust_remote_code
+    )
+    if tok.pad_token is None:
+        tok.pad_token = tok.eos_token
+    tok.save_pretrained(str(final_dir))
+    print("--- Merge complete ---")
+# --------------------------
+# Main
+# --------------------------
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--config", required=True, help="Path to YAML config")
+    ap.add_argument(
+        "--merge-only", action="store_true", help="Skip training, just merge adapter"
+    )
+    args = ap.parse_args()
+    with open(args.config, "r", encoding="utf-8") as f:
+        cfg = yaml.safe_load(f)
+    run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
+    _ensure_dir(run_dir / "logs")
+    with (run_dir / "config_resolved.yaml").open("w", encoding="utf-8") as f:
+        yaml.safe_dump(cfg, f, sort_keys=False)
+    model_cfg = cfg["model"]
+    repo_id = str(model_cfg["repo_id"]).strip()
+    repo_path = Path(repo_id)
+    # ✅ Local model path -> load directly; no download
+    if repo_path.exists() and repo_path.is_dir() and _looks_like_model_dir(repo_path):
+        base_dir = repo_path
+        print(f"Using local model at: {base_dir}")
+    elif repo_path.exists() and repo_path.is_dir():
+        raise ValueError(
+            f"model.repo_id points to a directory, but it doesn't look like a HF model dir: {base_dir}"
+        )
+    else:
+        # HF repo_id -> download into run_dir/base_local_dir
+        base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
+        if not _looks_like_model_dir(base_dir):
+            print(f"Base model not found at {base_dir}, downloading from {repo_id} ...")
+            snapshot_download(
+                repo_id=repo_id,
+                revision=model_cfg.get("revision", None),
+                local_dir=str(base_dir),
+                local_dir_use_symlinks=False,
+            )
+    ckpt_dir = _ensure_dir(run_dir / "checkpoints")
+    best_adapter_dir = _ensure_dir(run_dir / "best_adapter")
+    merge_cfg = cfg.get("merge", {}) or {}
+    if merge_cfg.get("output_dir"):
+        od = Path(str(merge_cfg["output_dir"]))
+        final_dir = od if od.is_absolute() else (run_dir / od)
+    else:
+        final_dir = run_dir / "final_model"
+    # Merge-only
+    if args.merge_only:
+        if not _looks_like_model_dir(best_adapter_dir):
+            raise FileNotFoundError(f"Adapter not found at {best_adapter_dir}")
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+        return
+    # Initialize Wandb
+    wandb_run = setup_wandb(cfg, run_dir)
+    # Training
+    set_seed(int(cfg["run"].get("seed", 42)))
+    model, tokenizer = load_base_model_and_tokenizer(cfg, base_dir)
+    model, _ = apply_peft(cfg, model)
+    train_ds, eval_ds = build_datasets(cfg, tokenizer)
+    tr_cfg = cfg["train"]
+    dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_fp16 = dtype == torch.float16
+    use_bf16 = dtype == torch.bfloat16
+    max_steps = int(tr_cfg.get("max_steps", 0))
+    num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
+    # --- Dynamic evaluation strategy parameter handling ---
+    ta_params = inspect.signature(TrainingArguments.__init__).parameters
+    eval_key = (
+        "eval_strategy" if "eval_strategy" in ta_params else "evaluation_strategy"
+    )
+    # Setup reporting based on wandb availability
+    report_to = []
+    if wandb_run is not None:
+        report_to.append("wandb")
+    ta_kwargs = dict(
+        output_dir=str(ckpt_dir),
+        max_steps=max_steps if max_steps > 0 else -1,
+        num_train_epochs=num_train_epochs,
+        per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)),
+        per_device_eval_batch_size=int(
+            tr_cfg.get(
+                "per_device_eval_batch_size",
+                tr_cfg.get("per_device_train_batch_size", 1),
+            )
+        ),
+        gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)),
+        learning_rate=float(tr_cfg.get("learning_rate", 2e-5)),
+        weight_decay=float(tr_cfg.get("weight_decay", 0.0)),
+        warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)),
+        lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")),
+        optim=str(
+            tr_cfg.get(
+                "optim",
+                (
+                    "paged_adamw_8bit"
+                    if bool(model_cfg.get("use_4bit", False))
+                    else "adamw_torch"
+                ),
+            )
+        ),
+        max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)),
+        logging_steps=int(tr_cfg.get("logging_steps", 10)),
+        save_strategy=str(tr_cfg.get("save_strategy", "steps")),
+        save_steps=int(tr_cfg.get("save_steps", 200)),
+        save_total_limit=int(tr_cfg.get("save_total_limit", 3)),
+        eval_steps=int(tr_cfg.get("eval_steps", 200)),
+        load_best_model_at_end=(
+            bool(tr_cfg.get("load_best_model_at_end", True))
+            if eval_ds is not None
+            else False
+        ),
+        metric_for_best_model="eval_loss",
+        greater_is_better=False,
+        fp16=use_fp16,
+        bf16=use_bf16,
+        report_to=report_to,
+        remove_unused_columns=False,
+    )
+    # Set the correct argument name for this transformers version
+    ta_kwargs[eval_key] = str(
+        tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no")
+    )
+    training_args = TrainingArguments(**ta_kwargs)
+    # Setup callbacks
+    callbacks = [JsonlLoggerCallback(run_dir)]
+    # Add early stopping callback if enabled
+    early_stopping_cfg = tr_cfg.get("early_stopping", {})
+    if early_stopping_cfg.get("enabled", False) and eval_ds is not None:
+        early_stopping_callback = EarlyStoppingCallback(
+            early_stopping_patience=int(early_stopping_cfg.get("patience", 3)),
+            early_stopping_threshold=float(early_stopping_cfg.get("min_delta", 0.001)),
+        )
+        callbacks.append(early_stopping_callback)
+        print(f"Early stopping enabled: patience={early_stopping_cfg.get('patience', 3)}, "
+              f"min_delta={early_stopping_cfg.get('min_delta', 0.001)}")
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=eval_ds,
+        data_collator=default_data_collator,
+        callbacks=callbacks,
+    )
+    # Resume
+    resume_from = tr_cfg.get("resume_from_checkpoint", None)
+    if resume_from == "auto":
+        last = get_last_checkpoint(str(ckpt_dir))
+        resume_from = last if last else None
+        if resume_from:
+            print(f"Resuming from {resume_from}")
+    print("Starting instruction fine-tuning...")
+    trainer.train(resume_from_checkpoint=resume_from)
+    trainer.save_model(str(best_adapter_dir))
+    print(f"Saved best adapter -> {best_adapter_dir}")
+    if eval_ds is not None:
+        metrics = trainer.evaluate()
+        eval_loss = metrics.get("eval_loss", None)
+        metrics["perplexity"] = _safe_exp(eval_loss) if eval_loss is not None else None
+        with (run_dir / "eval_final.json").open("w", encoding="utf-8") as f:
+            json.dump(metrics, f, indent=2)
+        print(f"Final eval_loss={eval_loss}, ppl={metrics['perplexity']}")
+    if bool(cfg.get("merge", {}).get("enabled", False)):
+        del trainer, model
+        torch.cuda.empty_cache()
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+    else:
+        print("Merge disabled. Run with --merge-only later if needed.")
+    # Finish Wandb run
+    finish_wandb()
+if __name__ == "__main__":
+    main()

trainer-kit/SFT/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

trainer-kit/SFT/config_instruct.yaml ADDED Viewed

	@@ -0,0 +1,144 @@

+run:
+  run_dir: "./runs/instruct_run_24b"
+  seed: 42
+# WandB integration for experiment tracking
+wandb:
+  enabled: true  # Set to true to enable wandb logging
+  project: "sft-training"  # WandB project name
+  entity: null  # WandB entity/team (optional)
+  name: null  # Run name (optional, will auto-generate if null)
+  tags: ["sft-lora", "24b-Devstral"]  # List of tags for the run (e.g., ["lora", "qlora", "experiment-1"])
+  notes: null  # Run description/notes (optional)
+model:
+  # Use local Qwen2.5-Coder-14B model
+  repo_id: "./CPT/runs/cpt_run_v1/merged_24b_cpt_lora"
+  revision: null
+  # Used only when repo_id is a HF repo (not a local path)
+  base_local_dir: "base_model"
+  trust_remote_code: true
+  tokenizer_use_fast: true
+  device_map: "auto"
+  torch_dtype: "bfloat16"  # "float16" | "bfloat16" | "float32"
+  # QLoRA
+  use_4bit: false
+  bnb_4bit_quant_type: "nf4"
+  bnb_4bit_use_double_quant: false
+  bnb_4bit_compute_dtype: "bfloat16"
+  # optional: "flash_attention_2" | "sdpa" | null
+  attn_implementation: null
+data:
+  train_jsonl: "../sft_dataset.jsonl"
+  eval_jsonl: null
+  eval_split_ratio: 0.1
+  # Field names in your JSONL data
+  instruction_field: "instruction"  # This will be the system prompt
+  input_field: "input"             # This is the task description
+  output_field: "output"           # This is the analysis + selection
+  # Formatting options
+  format_type: "custom"  # "chatml" | "alpaca" | "custom"
+  # For chatml format
+  system_prompt: |
+    You are a Hyperswitch Rust code analyzer. Identify functions/structs that need modification for a given task.
+    ## Output Format
+    ##OUTPUT
+    Explain the data flow and why each component must change:
+    - Flow: [Input → Processing → Output with arrows]
+    - For each component: "The [ComponentName] ([path]) must [action] because [reason]—without this, [consequence]"
+    - Explain coupling between components
+    ##SELECT
+    modify::crates/path/to/file.rs::impl::ComponentName
+    add::crates/another/file.rs::function::AnotherComponent
+    <EOS>
+    ## Rules
+    1. Use full paths: `remove::crates/folder/file.rs::Type::Name`
+    2. Use `::` for nested items: `status::StructName::Type::Name`
+    3. Always explain "must change because" and "without this"
+    3. Types of components: function, struct, enum, impl, trait
+    4. If there is extra information (e.g., enum variants), include that too.
+    5. Start with ##OUTPUT, end with ##SELECT, terminate with <EOS>
+    ## Example
+    ##TASK
+    Add webhook subscription support
+    ##OUTPUT
+    The webhook system routes events via EventClass enum. Flow: webhook → EventClass → handler → processing. The EventClass enum (crates/common_enums/src/enums.rs::EventClass) must add Subscriptions variant because it defines event routing—without this, subscription events cannot be processed. The SubscriptionStatus impl (crates/common_enums/src/transformers.rs::SubscriptionStatus) must map to EventType because it converts status to events—without this, status changes don't trigger webhooks. These are coupled: EventClass routes to handlers that use SubscriptionStatus mappings.
+    ##SELECT
+    crates/common_enums/src/enums.rs::EventClass
+    crates/common_enums/src/transformers.rs::SubscriptionStatus
+    <EOS>
+  # For custom format (only used when format_type="custom")
+  custom_template: "##INSTRUCTION\n{instruction}<|im_end|>\n##TASK\n{input}<|im_end|>\n##OUTPUT\n{output}<|im_end|>"
+  max_length: 2048
+  shuffle: true
+  num_proc: 4
+peft:
+  enabled: true
+  r: 8
+  lora_alpha: 16
+  lora_dropout: 0.05
+  bias: "none"
+  target_modules: "auto"
+train:
+  # max_steps: 10
+  num_train_epochs: 6
+  per_device_train_batch_size: 1
+  per_device_eval_batch_size: 1
+  gradient_accumulation_steps: 8
+  learning_rate: 1e-4
+  weight_decay: 0.0
+  warmup_ratio: 0.08
+  lr_scheduler_type: "cosine"
+  optim: "adamw_torch"  # ✅ Changed from paged_adamw_8bit (requires use_4bit=true)
+  max_grad_norm: 0.8
+  gradient_checkpointing: true
+  logging_steps: 2
+  save_strategy: "steps"
+  save_steps: 500
+  save_total_limit: 20
+  evaluation_strategy: "steps"
+  eval_steps: 100
+  load_best_model_at_end: true
+  # Early stopping
+  early_stopping:
+    enabled: true
+    patience: 3  # Number of evaluations with no improvement before stopping
+    min_delta: 0.001  # Minimum change to qualify as improvement
+    metric: "eval_loss"  # Metric to monitor
+    mode: "min"  # "min" for loss, "max" for accuracy/etc.
+  resume_from_checkpoint: "auto"
+merge:
+  enabled: true
+  merged_dtype: "float16"
+  max_shard_size: "2GB"
+  output_dir: "./merged_24b_instruct_lora"

trainer-kit/SFT/instruct_data.jsonl ADDED Viewed

	@@ -0,0 +1,4 @@

+{"input": "##TASK\nAdd subscription support to webhook event metadata and outgoing webhook content systems to enable proper subscription lifecycle event handling and webhook delivery for subscription-related operations.", "output": "##OUTPUT\nThe webhook system requires subscription event support to handle subscription lifecycle events (creation, payment, cancellation) and deliver them to merchants via outgoing webhooks. This affects the event metadata enumeration and webhook content transformation layers. The complete flow is: subscription event occurs → EventMetadata::Subscription variant created with subscription_id, invoice_id, payment_id → stored in database → retrieved for webhook delivery → OutgoingWebhookEventContent::foreign_from converts EventMetadata to OutgoingWebhookEventContent::Subscription → StripeWebhookObject::From transforms to StripeWebhookObject::Subscriptions for Stripe compatibility. The OutgoingWebhookEventContent::foreign_from function (router/src/core/webhooks/outgoing_v2.rs::OutgoingWebhookEventContent::foreign_from) must add the Subscription variant match arm because it's the bridge from storage EventMetadata to webhook content representation—without this, subscription events cannot be converted to webhook content and would be silently dropped during webhook processing. The StripeWebhookObject::From implementation (router/src/compatibility/stripe/webhooks.rs::StripeWebhookObject::From) must handle OutgoingWebhookContent::SubscriptionDetails because it provides Stripe-compatible webhook formatting—without this, merchants using Stripe webhook endpoints would receive malformed or missing subscription events. The EventMetadata enum (implicitly referenced in foreign_from) requires the Subscription variant to exist in the diesel_models layer, which enables persistent storage of subscription event references. These components form a tightly coupled webhook pipeline where each depends on the previous: EventMetadata → OutgoingWebhookEventContent → StripeWebhookObject. If foreign_from lacks the Subscription match arm, subscription events stored as EventMetadata::Subscription would cause panic during webhook delivery. If StripeWebhookObject::From lacks SubscriptionDetails handling, Stripe webhook endpoints would receive unparseable responses, breaking merchant integrations that rely on Stripe-compatible subscription webhooks for revenue recognition and customer management systems.\n\n##SELECT\ncrates/router/src/core/webhooks/outgoing_v2.rs::OutgoingWebhookEventContent::foreign_from\ncrates/router/src/compatibility/stripe/webhooks.rs::StripeWebhookObject::From\n<EOS>"}
+{"input": "##TASK\nUpdate subscription response mapping to use common_enums::SubscriptionStatus instead of subscription_types::SubscriptionStatus for consistent enum usage across the codebase.", "output": "##OUTPUT\nThe subscription system needs to standardize on a single SubscriptionStatus enum type to avoid type inconsistencies and import conflicts. The current flow is: subscription handler creates SubscriptionResponse → calls to_subscription_response method → converts string status to subscription_types::SubscriptionStatus → returns response. The to_subscription_response method (crates/subscriptions/src/core/subscription_handler.rs::SubscriptionWithHandler::to_subscription_response) must change because it's the conversion point where the status enum is instantiated. Using common_enums::SubscriptionStatus ensures consistency with other parts of the system that already use this common enum type. Without this change, the subscription module would maintain its own duplicate enum, creating potential mismatches and requiring conversions throughout the codebase when interacting with other modules that expect the common enum type.\n\n##SELECT\ncrates/subscriptions/src/core/subscription_handler.rs::SubscriptionWithHandler::to_subscription_response\n<EOS>"}
+{"input": "##TASK\nAdd subscription support to webhook event system by including Subscriptions in EventClass enum and implementing SubscriptionStatus to EventType mapping for proper webhook routing.", "output": "##OUTPUT\nThe webhook system needs to handle subscription-related events for proper event classification and routing. The current flow is: webhook event received → EventClass determines routing category → EventType specifies the exact event type → handlers process based on event type. The EventClass enum (crates/common_enums/src/enums.rs::EventClass) must include Subscriptions because it's the primary classification that determines which subsystem handles the event. The SubscriptionStatus transformer (crates/common_enums/src/transformers.rs::impl::Option<EventType>) must map subscription status changes to appropriate EventType values because it converts domain state changes into webhook events. Without EventClass::Subscriptions, subscription events would be misclassified or unhandled. Without the SubscriptionStatus mapping, subscription status changes wouldn't trigger appropriate webhook notifications, breaking integration with external systems that depend on subscription lifecycle events.\n\n##SELECT\ncrates/common_enums/src/enums.rs::EventClass\ncrates/common_enums/src/transformers.rs::impl::Option<EventType>\n<EOS>"}
+{"input": "##TASK\nAdd subscription ID support to webhook object reference system for proper webhook payload identification and routing.", "output": "##OUTPUT\nThe webhook system needs to reference subscription entities in event payloads for proper event correlation and processing. The current flow is: webhook event generated → ObjectReferenceId identifies the affected entity → webhook payload includes reference → consumers process based on entity type. The ObjectReferenceId enum (crates/api_models/src/webhooks.rs::ObjectReferenceId) must include SubscriptionId because it's the type-safe identifier used throughout the webhook payload structure to specify which subscription triggered the event. Without SubscriptionId, webhook events related to subscriptions couldn't properly reference the subscription entity, making it impossible for consumers to correlate events with specific subscriptions. This would break webhook consumers that need to update their local state or trigger business logic based on subscription events.\n\n##SELECT\ncrates/api_models/src/webhooks.rs::ObjectReferenceId\n<EOS>"}

trainer-kit/SFT/requirements.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+# Core
+torch>=2.1.0
+transformers>=4.41.0
+datasets>=2.18.0
+accelerate>=0.30.0
+# PEFT / QLoRA
+peft>=0.11.1
+bitsandbytes>=0.43.1
+# Hugging Face Hub (local + download support)
+huggingface_hub>=0.23.0
+# Config + utilities
+pyyaml>=6.0
+tqdm>=4.66.0
+# Optional but recommended (tokenizers speed)
+tokenizers>=0.15.0
+safetensors>=0.4.2
+# Experiment tracking
+wandb>=0.16.0

trainer-kit/SFT/run_instruct.py ADDED Viewed

	@@ -0,0 +1,921 @@

+import argparse
+import json
+import inspect  # Added for Transformers version compatibility
+import math
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple, List
+import torch
+import yaml
+from datasets import load_dataset, DatasetDict
+from huggingface_hub import snapshot_download
+from transformers import (
+    AutoTokenizer,
+    AutoModelForCausalLM,
+    AutoModel,
+    AutoConfig,
+    BitsAndBytesConfig,
+    TrainingArguments,
+    Trainer,
+    TrainerCallback,
+    EarlyStoppingCallback,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from peft import (
+    LoraConfig,
+    get_peft_model,
+    prepare_model_for_kbit_training,
+    PeftModel,
+)
+try:
+    import wandb
+    WANDB_AVAILABLE = True
+except ImportError:
+    WANDB_AVAILABLE = False
+    wandb = None
+# --------------------------
+# Helpers
+# --------------------------
+def _dtype_from_str(s: str) -> torch.dtype:
+    s = (s or "").lower()
+    if s in ("float16", "fp16"):
+        return torch.float16
+    if s in ("bfloat16", "bf16"):
+        return torch.bfloat16
+    if s in ("float32", "fp32"):
+        return torch.float32
+    raise ValueError(f"Unknown torch_dtype: {s}")
+def _now_iso() -> str:
+    return time.strftime("%Y-%m-%dT%H:%M:%S", time.localtime())
+def _safe_exp(x: float) -> float:
+    x = min(float(x), 50.0)
+    return float(math.exp(x))
+def _ensure_dir(p: Path) -> Path:
+    p.mkdir(parents=True, exist_ok=True)
+    return p
+def _looks_like_model_dir(p: Path) -> bool:
+    if not p.exists() or not p.is_dir():
+        return False
+    if (p / "config.json").exists():
+        return True
+    if any(p.glob("*.safetensors")) or any(p.glob("pytorch_model*.bin")):
+        return True
+    return False
+def _infer_target_modules(model) -> List[str]:
+    names = set()
+    for n, _ in model.named_modules():
+        names.add(n.split(".")[-1])
+    for group in [
+        ["q_proj", "k_proj", "v_proj", "o_proj"],
+        ["Wqkv", "out_proj"],
+        ["query_key_value", "dense"],
+        ["c_attn", "c_proj"],
+    ]:
+        if all(x in names for x in group):
+            return group
+    fallback = [
+        x
+        for x in [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "c_attn",
+            "c_proj",
+            "out_proj",
+            "dense",
+        ]
+        if x in names
+    ]
+    if fallback:
+        return fallback
+    raise ValueError(
+        "Could not auto-infer target_modules. Set peft.target_modules explicitly."
+    )
+def _choose_attn_impl(cfg: Dict[str, Any]) -> Optional[str]:
+    return cfg.get("model", {}).get("attn_implementation", None)
+# --------------------------
+# Wandb Integration
+# --------------------------
+def setup_wandb(cfg: Dict[str, Any], run_dir: Path):
+    """Initialize Wandb if enabled in configuration."""
+    wandb_cfg = cfg.get("wandb", {})
+    if not wandb_cfg.get("enabled", False):
+        print("Wandb logging disabled")
+        return None
+    if not WANDB_AVAILABLE:
+        print("Wandb not available. Install with: pip install wandb")
+        return None
+    # Extract wandb configuration
+    project = wandb_cfg.get("project", "sft-training")
+    entity = wandb_cfg.get("entity", None)
+    name = wandb_cfg.get("name", None)
+    tags = wandb_cfg.get("tags", [])
+    notes = wandb_cfg.get("notes", None)
+    # Initialize wandb
+    try:
+        wandb.init(
+            project=project,
+            entity=entity,
+            name=name,
+            tags=tags,
+            notes=notes,
+            dir=str(run_dir),
+            config={
+                "model": cfg.get("model", {}),
+                "data": cfg.get("data", {}),
+                "peft": cfg.get("peft", {}),
+                "train": cfg.get("train", {}),
+                "run_dir": str(run_dir),
+            }
+        )
+        print(f"Wandb initialized: project='{project}', name='{name or 'auto-generated'}'")
+        return wandb
+    except Exception as e:
+        print(f"Failed to initialize Wandb: {e}")
+        return None
+def finish_wandb():
+    """Finish Wandb run if active."""
+    if WANDB_AVAILABLE and wandb.run is not None:
+        wandb.finish()
+        print("Wandb run finished")
+# --------------------------
+# JSONL Logger Callback
+# --------------------------
+class JsonlLoggerCallback(TrainerCallback):
+    def __init__(self, run_dir: Path):
+        self.run_dir = run_dir
+        self.train_log_path = _ensure_dir(run_dir / "logs") / "train.jsonl"
+        self.eval_log_path = _ensure_dir(run_dir / "logs") / "eval.jsonl"
+        self.start_time = None
+    def _eta(self, global_step: int, max_steps: int) -> Optional[str]:
+        if self.start_time is None or global_step <= 0 or max_steps <= 0:
+            return None
+        elapsed = time.time() - self.start_time
+        sec_per_step = elapsed / global_step
+        remaining = max(0, max_steps - global_step) * sec_per_step
+        h = int(remaining // 3600)
+        m = int((remaining % 3600) // 60)
+        s = int(remaining % 60)
+        return f"{h:02d}:{m:02d}:{s:02d}"
+    def on_train_begin(self, args, state, control, **kwargs):
+        self.start_time = time.time()
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        max_steps = int(state.max_steps) if getattr(state, "max_steps", None) else 0
+        progress_pct = (
+            (100.0 * state.global_step / max_steps) if max_steps > 0 else None
+        )
+        epoch_pct = None
+        if (
+            state.epoch is not None
+            and args.num_train_epochs
+            and args.num_train_epochs > 0
+        ):
+            epoch_pct = 100.0 * (float(state.epoch) / float(args.num_train_epochs))
+        payload = {
+            "ts": _now_iso(),
+            "event": "train_log",
+            "step": int(state.global_step),
+            "epoch": round(float(state.epoch), 4) if state.epoch is not None else None,
+            "progress_pct": (
+                round(progress_pct, 2) if progress_pct is not None else None
+            ),
+            "epoch_pct": round(epoch_pct, 2) if epoch_pct is not None else None,
+            "eta": self._eta(int(state.global_step), max_steps),
+            "max_grad_norm": getattr(args, "max_grad_norm", None),
+            **logs,
+        }
+        with self.train_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+        if not metrics:
+            return
+        eval_loss = metrics.get("eval_loss", None)
+        ppl = _safe_exp(eval_loss) if eval_loss is not None else None
+        payload = {
+            "ts": _now_iso(),
+            "event": "eval",
+            "step": int(state.global_step),
+            "epoch": float(state.epoch) if state.epoch is not None else None,
+            **metrics,
+            "perplexity": ppl,
+        }
+        with self.eval_log_path.open("a", encoding="utf-8") as f:
+            f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+# --------------------------
+# Data Pipeline (Instruction Formatting)
+# --------------------------
+def format_instruction(
+    example: Dict[str, Any], cfg: Dict[str, Any], tokenizer
+) -> Dict[str, Any]:
+    """
+    Format instruction data for training.
+    Supports multiple formats: chatml, alpaca, custom templates.
+    Returns both formatted text and the response start position for loss masking.
+    """
+    data_cfg = cfg["data"]
+    format_type = data_cfg.get("format_type", "chatml")
+    # Get field names from config
+    input_field = data_cfg.get("input_field", "input")
+    output_field = data_cfg.get("output_field", "output")
+    instruction_field = data_cfg.get("instruction_field", "instruction")
+    # Extract text from example
+    instruction = example.get(instruction_field, "")
+    input_text = example.get(input_field, "")
+    output_text = example.get(output_field, "")
+    if format_type == "chatml":
+        # ChatML format with special tokens
+        system_prompt = data_cfg.get("system_prompt", "You are a helpful assistant.")
+        messages = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        user_content = instruction
+        if input_text:
+            user_content = f"{instruction}\n\n{input_text}"
+        messages.append({"role": "user", "content": user_content})
+        messages.append({"role": "assistant", "content": output_text})
+        # Apply chat template
+        formatted_text = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=False
+        )
+        # Add EOS token if not present
+        if tokenizer.eos_token and not formatted_text.endswith(tokenizer.eos_token):
+            formatted_text += tokenizer.eos_token
+        # Find where the assistant response starts for loss masking
+        # Try multiple possible markers for robustness
+        markers = ["<|im_start|>assistant", "<|assistant|>", "Assistant:", "assistant\n"]
+        response_start_pos = -1
+        for marker in markers:
+            idx = formatted_text.find(marker)
+            if idx != -1:
+                # Find the newline after the marker
+                newline_idx = formatted_text.find("\n", idx)
+                if newline_idx != -1:
+                    response_start_pos = newline_idx + 1
+                    break
+        # Fallback: find where the actual output starts
+        if response_start_pos == -1:
+            output_idx = formatted_text.find(output_text)
+            if output_idx != -1:
+                response_start_pos = output_idx
+            else:
+                # Last resort: split at last occurrence of newline before end
+                response_start_pos = formatted_text.rfind("\n", 0, len(formatted_text) - len(output_text)) + 1
+    elif format_type == "alpaca":
+        # Alpaca format
+        if input_text:
+            prefix = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
+        else:
+            prefix = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n"
+        formatted_text = prefix + output_text
+        # Add EOS token
+        if tokenizer.eos_token:
+            formatted_text += tokenizer.eos_token
+        # Response starts after the prefix
+        response_start_pos = len(prefix)
+    elif format_type == "custom":
+        # Custom template from config
+        template = data_cfg.get("custom_template", "{instruction}\n{input}\n{output}")
+        # For custom format, use system_prompt as instruction if instruction field is empty
+        if not instruction:
+            instruction = data_cfg.get("system_prompt", "")
+        # For custom templates, we need to find where {output} starts
+        template_parts = template.split("{output}")
+        prefix = template_parts[0].format(instruction=instruction, input=input_text)
+        formatted_text = prefix + output_text
+        # Add EOS token if not already in template
+        if tokenizer.eos_token and not formatted_text.endswith(tokenizer.eos_token):
+            formatted_text += tokenizer.eos_token
+        # Response starts after the prefix
+        response_start_pos = len(prefix)
+    else:
+        raise ValueError(f"Unsupported format_type: {format_type}")
+    return {"text": formatted_text, "response_start_pos": response_start_pos}
+def build_datasets(cfg: Dict[str, Any], tokenizer) -> Tuple[Any, Any]:
+    """
+    Build datasets for instruction fine-tuning.
+    """
+    data_cfg = cfg["data"]
+    train_path = data_cfg["train_jsonl"]
+    eval_path = data_cfg.get("eval_jsonl", None)
+    split_ratio = float(data_cfg.get("eval_split_ratio", 0.0))
+    max_length = int(data_cfg.get("max_length", 2048))
+    shuffle = bool(data_cfg.get("shuffle", True))
+    num_proc = int(data_cfg.get("num_proc", 4))
+    # Ensure tokenizer has pad token
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load datasets
+    ds = load_dataset("json", data_files={"train": train_path})
+    if eval_path:
+        ds_eval = load_dataset("json", data_files={"eval": eval_path})
+        dsd = DatasetDict({"train": ds["train"], "eval": ds_eval["eval"]})
+    else:
+        if 0.0 < split_ratio < 1.0:
+            split = ds["train"].train_test_split(
+                test_size=split_ratio, seed=int(cfg["run"].get("seed", 42))
+            )
+            dsd = DatasetDict({"train": split["train"], "eval": split["test"]})
+        else:
+            dsd = DatasetDict({"train": ds["train"], "eval": None})
+    # Format instructions and track response start positions
+    def format_fn(examples):
+        formatted_examples = []
+        response_start_positions = []
+        for i in range(len(examples[list(examples.keys())[0]])):
+            example = {k: examples[k][i] for k in examples.keys()}
+            formatted = format_instruction(example, cfg, tokenizer)
+            formatted_examples.append(formatted["text"])
+            response_start_positions.append(formatted["response_start_pos"])
+        return {
+            "text": formatted_examples,
+            "response_start_pos": response_start_positions
+        }
+    formatted_train = dsd["train"].map(
+        format_fn,
+        batched=True,
+        num_proc=num_proc,
+        remove_columns=dsd["train"].column_names,
+        desc="Formatting train instructions",
+    )
+    formatted_eval = None
+    if dsd["eval"] is not None:
+        formatted_eval = dsd["eval"].map(
+            format_fn,
+            batched=True,
+            num_proc=num_proc,
+            remove_columns=dsd["eval"].column_names,
+            desc="Formatting eval instructions",
+        )
+    # Tokenize and apply loss masking
+    def tokenize_and_mask_fn(examples):
+        tokenized = tokenizer(
+            examples["text"],
+            truncation=True,
+            padding=False,
+            max_length=max_length,
+            return_overflowing_tokens=False,
+        )
+        # Apply loss masking - CRITICAL for SFT
+        labels = []
+        attention_masks = []
+        for i in range(len(tokenized["input_ids"])):
+            input_ids = tokenized["input_ids"][i]
+            response_start_pos = examples["response_start_pos"][i]
+            # Get the instruction part (before response)
+            full_text = examples["text"][i]
+            instruction_text = full_text[:response_start_pos]
+            # Create labels masked by default
+            label_ids = [-100] * len(input_ids)
+            # Find where response starts using character-based ratio
+            # This is more reliable than tokenizing prefix separately
+            # because separate tokenization can add different special tokens
+            char_ratio = response_start_pos / max(len(full_text), 1)
+            response_start_idx = int(len(input_ids) * char_ratio)
+            # Ensure we have valid bounds (at least position 1, at most len-1)
+            response_start_idx = max(1, min(response_start_idx, len(input_ids) - 1))
+            # Unmask response tokens (including EOS)
+            for j in range(response_start_idx, len(input_ids)):
+                label_ids[j] = input_ids[j]
+            # Create attention mask (1 for real tokens, 0 for padding)
+            attention_mask = [1] * len(input_ids)
+            labels.append(label_ids)
+            attention_masks.append(attention_mask)
+        tokenized["labels"] = labels
+        tokenized["attention_mask"] = attention_masks
+        return tokenized
+    tokenized_train = formatted_train.map(
+        tokenize_and_mask_fn,
+        batched=True,
+        num_proc=num_proc,
+        desc="Tokenizing and masking train",
+    )
+    tokenized_eval = None
+    if formatted_eval is not None:
+        tokenized_eval = formatted_eval.map(
+            tokenize_and_mask_fn,
+            batched=True,
+            num_proc=num_proc,
+            desc="Tokenizing and masking eval",
+        )
+    if shuffle:
+        tokenized_train = tokenized_train.shuffle(seed=int(cfg["run"].get("seed", 42)))
+    return tokenized_train, tokenized_eval
+# --------------------------
+# Model Loading + PEFT
+# --------------------------
+def load_base_model_and_tokenizer(cfg: Dict[str, Any], base_dir: Path):
+    model_cfg = cfg["model"]
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    use_fast = bool(model_cfg.get("tokenizer_use_fast", True))
+    device_map = model_cfg.get("device_map", "auto")
+    tokenizer = AutoTokenizer.from_pretrained(
+        str(base_dir),
+        use_fast=use_fast,
+        trust_remote_code=trust_remote_code,
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    torch_dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    quant_cfg = None
+    if use_4bit:
+        quant_cfg = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type=str(model_cfg.get("bnb_4bit_quant_type", "nf4")),
+            bnb_4bit_use_double_quant=bool(
+                model_cfg.get("bnb_4bit_use_double_quant", True)
+            ),
+            bnb_4bit_compute_dtype=_dtype_from_str(
+                model_cfg.get("bnb_4bit_compute_dtype", "bfloat16")
+            ),
+        )
+    attn_impl = _choose_attn_impl(cfg)
+    # First check the model type to determine loading strategy
+    try:
+        config = AutoConfig.from_pretrained(str(base_dir), trust_remote_code=True)
+        model_type = config.model_type
+        architectures = getattr(config, 'architectures', [])
+        # Handle Mistral3 (multimodal) models
+        if model_type == "mistral3" or (architectures and "Mistral3" in architectures[0]):
+            print(f"[info] Detected Mistral3 model architecture, loading with specific class")
+            from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration
+            try:
+                model = Mistral3ForConditionalGeneration.from_pretrained(
+                    str(base_dir),
+                    config=config,
+                    device_map=device_map,
+                    low_cpu_mem_usage=True,
+                    torch_dtype=(torch_dtype if not use_4bit else None),
+                    quantization_config=quant_cfg,
+                    attn_implementation=attn_impl,
+                )
+            except Exception as e:
+                if attn_impl is not None:
+                    print(f"[warn] attn_implementation='{attn_impl}' failed: {e}")
+                    print("[warn] Falling back to default attention implementation.")
+                    model = Mistral3ForConditionalGeneration.from_pretrained(
+                        str(base_dir),
+                        config=config,
+                        device_map=device_map,
+                        low_cpu_mem_usage=True,
+                        torch_dtype=(torch_dtype if not use_4bit else None),
+                        quantization_config=quant_cfg,
+                    )
+                else:
+                    raise e
+        else:
+            # Standard AutoModelForCausalLM loading for other models
+            try:
+                model = AutoModelForCausalLM.from_pretrained(
+                    str(base_dir),
+                    device_map=device_map,
+                    trust_remote_code=True,
+                    low_cpu_mem_usage=True,
+                    torch_dtype=(torch_dtype if not use_4bit else None),
+                    quantization_config=quant_cfg,
+                    attn_implementation=attn_impl,
+                )
+            except Exception as e:
+                if attn_impl is not None:
+                    print(f"[warn] attn_implementation='{attn_impl}' failed: {e}")
+                    print("[warn] Falling back to default attention implementation.")
+                    model = AutoModelForCausalLM.from_pretrained(
+                        str(base_dir),
+                        device_map=device_map,
+                        trust_remote_code=True,
+                        low_cpu_mem_usage=True,
+                        torch_dtype=(torch_dtype if not use_4bit else None),
+                        quantization_config=quant_cfg,
+                    )
+                else:
+                    raise e
+    except Exception as e:
+        print(f"[error] Failed to load model: {e}")
+        raise e
+    # Ensure all parameters are off meta device
+    print("[info] Ensuring all parameters are materialized...")
+    meta_params = []
+    for name, param in model.named_parameters():
+        if param.device.type == 'meta':
+            meta_params.append(name)
+    if meta_params:
+        print(f"[warn] Found {len(meta_params)} parameters on meta device")
+        # For multimodal models, freeze vision components if doing text-only training
+        if hasattr(model, 'vision_tower'):
+            print("[info] Freezing vision tower for text-only training")
+            for param in model.vision_tower.parameters():
+                param.requires_grad = False
+    return model, tokenizer
+def apply_peft(cfg: Dict[str, Any], model):
+    peft_cfg = cfg["peft"]
+    model_cfg = cfg["model"]
+    tr_cfg = cfg["train"]
+    if not bool(peft_cfg.get("enabled", True)):
+        return model, None
+    use_4bit = bool(model_cfg.get("use_4bit", False))
+    gradient_checkpointing = bool(tr_cfg.get("gradient_checkpointing", True))
+    # For multimodal models, ensure vision tower doesn't use gradient checkpointing
+    if gradient_checkpointing and hasattr(model, "gradient_checkpointing_enable"):
+        if hasattr(model, 'vision_tower'):
+            print("[info] Disabling gradient checkpointing for vision tower")
+            # Only enable gradient checkpointing on language model
+            if hasattr(model, 'language_model'):
+                model.language_model.gradient_checkpointing_enable()
+            elif hasattr(model, 'lm_head'):
+                model.gradient_checkpointing_enable()
+        else:
+            model.gradient_checkpointing_enable()
+        if hasattr(model, "config"):
+            model.config.use_cache = False
+    if use_4bit:
+        model = prepare_model_for_kbit_training(
+            model,
+            use_gradient_checkpointing=gradient_checkpointing,
+        )
+    target_modules = peft_cfg.get("target_modules", "auto")
+    if target_modules == "auto":
+        target_modules = _infer_target_modules(model)
+    # For multimodal models, ensure we only target language model modules
+    if hasattr(model, 'vision_tower') and isinstance(target_modules, list):
+        print(f"[info] Filtering target modules to exclude vision tower")
+        # Filter out any vision tower modules
+        target_modules = [m for m in target_modules if 'vision' not in m.lower()]
+        print(f"[info] LoRA target modules: {target_modules}")
+    lora_config = LoraConfig(
+        r=int(peft_cfg.get("r", 16)),
+        lora_alpha=int(peft_cfg.get("lora_alpha", 32)),
+        lora_dropout=float(peft_cfg.get("lora_dropout", 0.05)),
+        bias=str(peft_cfg.get("bias", "none")),
+        task_type="CAUSAL_LM",
+        target_modules=target_modules,
+        modules_to_save=None,  # Don't update any additional modules
+    )
+    model = get_peft_model(model, lora_config)
+    return model, lora_config
+# --------------------------
+# Merge Logic
+# --------------------------
+def merge_adapter(
+    cfg: Dict[str, Any], base_dir: Path, adapter_dir: Path, final_dir: Path
+):
+    print(f"--- Merge: {adapter_dir} + {base_dir} -> {final_dir} ---")
+    model_cfg = cfg["model"]
+    merge_cfg = cfg.get("merge", {})
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", True))
+    merged_dtype = _dtype_from_str(merge_cfg.get("merged_dtype", "float16"))
+    max_shard_size = str(merge_cfg.get("max_shard_size", "2GB"))
+    base = AutoModelForCausalLM.from_pretrained(
+        str(base_dir),
+        torch_dtype=merged_dtype,
+        device_map="cpu",
+        low_cpu_mem_usage=True,
+        trust_remote_code=trust_remote_code,
+    )
+    merged = PeftModel.from_pretrained(base, str(adapter_dir))
+    merged = merged.merge_and_unload()
+    _ensure_dir(final_dir)
+    merged.save_pretrained(
+        str(final_dir), safe_serialization=True, max_shard_size=max_shard_size
+    )
+    tok = AutoTokenizer.from_pretrained(
+        str(base_dir), trust_remote_code=trust_remote_code
+    )
+    if tok.pad_token is None:
+        tok.pad_token = tok.eos_token
+    tok.save_pretrained(str(final_dir))
+    print("--- Merge complete ---")
+# --------------------------
+# Main
+# --------------------------
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--config", required=True, help="Path to YAML config")
+    ap.add_argument(
+        "--merge-only", action="store_true", help="Skip training, just merge adapter"
+    )
+    args = ap.parse_args()
+    with open(args.config, "r", encoding="utf-8") as f:
+        cfg = yaml.safe_load(f)
+    run_dir = _ensure_dir(Path(cfg["run"]["run_dir"]))
+    _ensure_dir(run_dir / "logs")
+    with (run_dir / "config_resolved.yaml").open("w", encoding="utf-8") as f:
+        yaml.safe_dump(cfg, f, sort_keys=False)
+    model_cfg = cfg["model"]
+    repo_id = str(model_cfg["repo_id"]).strip()
+    repo_path = Path(repo_id)
+    # ✅ Local model path -> load directly; no download
+    if repo_path.exists() and repo_path.is_dir() and _looks_like_model_dir(repo_path):
+        base_dir = repo_path
+        print(f"Using local model at: {base_dir}")
+    elif repo_path.exists() and repo_path.is_dir():
+        raise ValueError(
+            f"model.repo_id points to a directory, but it doesn't look like a HF model dir: {base_dir}"
+        )
+    else:
+        # HF repo_id -> download into run_dir/base_local_dir
+        base_dir = _ensure_dir(run_dir / model_cfg.get("base_local_dir", "base_model"))
+        if not _looks_like_model_dir(base_dir):
+            print(f"Base model not found at {base_dir}, downloading from {repo_id} ...")
+            snapshot_download(
+                repo_id=repo_id,
+                revision=model_cfg.get("revision", None),
+                local_dir=str(base_dir),
+                local_dir_use_symlinks=False,
+            )
+    ckpt_dir = _ensure_dir(run_dir / "checkpoints")
+    best_adapter_dir = _ensure_dir(run_dir / "best_adapter")
+    merge_cfg = cfg.get("merge", {}) or {}
+    if merge_cfg.get("output_dir"):
+        od = Path(str(merge_cfg["output_dir"]))
+        final_dir = od if od.is_absolute() else (run_dir / od)
+    else:
+        final_dir = run_dir / "final_model"
+    # Merge-only
+    if args.merge_only:
+        if not _looks_like_model_dir(best_adapter_dir):
+            raise FileNotFoundError(f"Adapter not found at {best_adapter_dir}")
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+        return
+    # Initialize Wandb
+    wandb_run = setup_wandb(cfg, run_dir)
+    # Training
+    set_seed(int(cfg["run"].get("seed", 42)))
+    model, tokenizer = load_base_model_and_tokenizer(cfg, base_dir)
+    model, _ = apply_peft(cfg, model)
+    train_ds, eval_ds = build_datasets(cfg, tokenizer)
+    tr_cfg = cfg["train"]
+    dtype = _dtype_from_str(model_cfg.get("torch_dtype", "bfloat16"))
+    use_fp16 = dtype == torch.float16
+    use_bf16 = dtype == torch.bfloat16
+    max_steps = int(tr_cfg.get("max_steps", 0))
+    num_train_epochs = float(tr_cfg.get("num_train_epochs", 1))
+    # --- Dynamic evaluation strategy parameter handling ---
+    ta_params = inspect.signature(TrainingArguments.__init__).parameters
+    eval_key = (
+        "eval_strategy" if "eval_strategy" in ta_params else "evaluation_strategy"
+    )
+    # Setup reporting based on wandb availability
+    report_to = []
+    if wandb_run is not None:
+        report_to.append("wandb")
+    ta_kwargs = dict(
+        output_dir=str(ckpt_dir),
+        max_steps=max_steps if max_steps > 0 else -1,
+        num_train_epochs=num_train_epochs,
+        per_device_train_batch_size=int(tr_cfg.get("per_device_train_batch_size", 1)),
+        per_device_eval_batch_size=int(
+            tr_cfg.get(
+                "per_device_eval_batch_size",
+                tr_cfg.get("per_device_train_batch_size", 1),
+            )
+        ),
+        gradient_accumulation_steps=int(tr_cfg.get("gradient_accumulation_steps", 1)),
+        learning_rate=float(tr_cfg.get("learning_rate", 2e-5)),
+        weight_decay=float(tr_cfg.get("weight_decay", 0.0)),
+        warmup_ratio=float(tr_cfg.get("warmup_ratio", 0.0)),
+        lr_scheduler_type=str(tr_cfg.get("lr_scheduler_type", "cosine")),
+        optim=str(
+            tr_cfg.get(
+                "optim",
+                (
+                    "paged_adamw_8bit"
+                    if bool(model_cfg.get("use_4bit", False))
+                    else "adamw_torch"
+                ),
+            )
+        ),
+        max_grad_norm=float(tr_cfg.get("max_grad_norm", 1.0)),
+        logging_steps=int(tr_cfg.get("logging_steps", 10)),
+        save_strategy=str(tr_cfg.get("save_strategy", "steps")),
+        save_steps=int(tr_cfg.get("save_steps", 200)),
+        save_total_limit=int(tr_cfg.get("save_total_limit", 3)),
+        eval_steps=int(tr_cfg.get("eval_steps", 200)),
+        load_best_model_at_end=(
+            bool(tr_cfg.get("load_best_model_at_end", True))
+            if eval_ds is not None
+            else False
+        ),
+        metric_for_best_model="eval_loss",
+        greater_is_better=False,
+        fp16=use_fp16,
+        bf16=use_bf16,
+        report_to=report_to,
+        remove_unused_columns=False,
+    )
+    # Set the correct argument name for this transformers version
+    ta_kwargs[eval_key] = str(
+        tr_cfg.get("evaluation_strategy", "steps" if eval_ds is not None else "no")
+    )
+    training_args = TrainingArguments(**ta_kwargs)
+    # Setup callbacks
+    callbacks = [JsonlLoggerCallback(run_dir)]
+    # Add early stopping callback if enabled
+    early_stopping_cfg = tr_cfg.get("early_stopping", {})
+    if early_stopping_cfg.get("enabled", False) and eval_ds is not None:
+        early_stopping_callback = EarlyStoppingCallback(
+            early_stopping_patience=int(early_stopping_cfg.get("patience", 3)),
+            early_stopping_threshold=float(early_stopping_cfg.get("min_delta", 0.001)),
+        )
+        callbacks.append(early_stopping_callback)
+        print(f"Early stopping enabled: patience={early_stopping_cfg.get('patience', 3)}, "
+              f"min_delta={early_stopping_cfg.get('min_delta', 0.001)}")
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=eval_ds,
+        data_collator=default_data_collator,
+        callbacks=callbacks,
+    )
+    # Resume
+    resume_from = tr_cfg.get("resume_from_checkpoint", None)
+    if resume_from == "auto":
+        last = get_last_checkpoint(str(ckpt_dir))
+        resume_from = last if last else None
+        if resume_from:
+            print(f"Resuming from {resume_from}")
+    print("Starting instruction fine-tuning...")
+    trainer.train(resume_from_checkpoint=resume_from)
+    trainer.save_model(str(best_adapter_dir))
+    print(f"Saved best adapter -> {best_adapter_dir}")
+    if eval_ds is not None:
+        metrics = trainer.evaluate()
+        eval_loss = metrics.get("eval_loss", None)
+        metrics["perplexity"] = _safe_exp(eval_loss) if eval_loss is not None else None
+        with (run_dir / "eval_final.json").open("w", encoding="utf-8") as f:
+            json.dump(metrics, f, indent=2)
+        print(f"Final eval_loss={eval_loss}, ppl={metrics['perplexity']}")
+    if bool(cfg.get("merge", {}).get("enabled", False)):
+        del trainer, model
+        torch.cuda.empty_cache()
+        merge_adapter(cfg, base_dir, best_adapter_dir, final_dir)
+    else:
+        print("Merge disabled. Run with --merge-only later if needed.")
+    # Finish Wandb run
+    finish_wandb()
+if __name__ == "__main__":
+    main()

trainer-kit/documentation.md ADDED Viewed

	@@ -0,0 +1,296 @@

+# CPT Training Different Modules Guide
+## Overview
+By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only **attention projection layers** using LoRA adapters. This guide explains how to modify the configuration to train other modules.
+## Current Default Configuration
+```yaml
+peft:
+  enabled: true
+  target_modules: "auto"
+```
+When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers:
+- `q_proj` - Query projection
+- `k_proj` - Key projection
+- `v_proj` - Value projection
+- `o_proj` - Output projection
+## How to Train Other Modules
+### Method 1: Explicit Target Modules
+Replace `"auto"` with a list of specific module names you want to train:
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.down_proj"    # Add MLP down projection
+    - "mlp.gate_proj"    # Add MLP gate projection
+    - "mlp.up_proj"      # Add MLP up projection
+```
+### Method 2: Custom Module Lists
+For different model architectures, here are common modules you can train:
+#### LLaMA/Llama-style Models
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+```
+#### Qwen-style Models
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+```
+#### Mixtral/Gemma-style Models
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.experts.*.w1"    # Expert layer 1
+    - "mlp.experts.*.w2"    # Expert layer 2
+    - "mlp.experts.*.w3"    # Expert layer 3
+```
+## Module Types You Can Train
+### 1. Attention Layers
+- `q_proj` - Query projections
+- `k_proj` - Key projections
+- `v_proj` - Value projections
+- `o_proj` - Output projections
+- `qkv_proj` - Combined QKV (in some models)
+- `c_attn` - Attention in older models
+### 2. MLP/Feed-Forward Layers
+- `mlp.gate_proj` - Gate projection
+- `mlp.up_proj` - Up projection
+- `mlp.down_proj` - Down projection
+- `mlp.fc1` - First layer
+- `mlp.fc2` - Second layer
+- `w1`, `w2`, `w3` - Alternative naming
+### 3. Embedding Layers
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "model.embed_tokens"  # Token embeddings
+    - "lm_head"             # Language model head
+```
+### 4. Normalization Layers
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "input_layernorm"      # Input normalization
+    - "post_attention_layernorm" # Post-attention norm
+    - "final_layernorm"      # Final normalization
+```
+### 5. MoE (Mixture of Experts) Layers
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "mlp.experts.*.w1"     # Expert layer 1
+    - "mlp.experts.*.w2"     # Expert layer 2
+    - "mlp.experts.*.w3"     # Expert layer 3
+    - "mlp.gate"             # Expert routing gate
+```
+## Advanced Configuration Examples
+### Train Multiple Layer Types
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+    - "input_layernorm"
+    - "post_attention_layernorm"
+```
+### Conservative Approach (Only MLPs)
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+```
+### Comprehensive Approach (All Main Layers)
+```yaml
+peft:
+  enabled: true
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+    - "input_layernorm"
+    - "post_attention_layernorm"
+```
+## How to Find Module Names for Your Model
+### Method 1: Automatic Detection
+Run the script once with `target_modules: "auto"` - it will log which modules it found:
+```
+Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
+```
+### Method 2: Manual Inspection
+Inspect your model structure:
+```python
+import torch
+from transformers import AutoModel
+model = AutoModel.from_pretrained("/workspace/Models/YourModel")
+# Print all module names
+for name, module in model.named_modules():
+    print(name)
+```
+### Method 3: Use PEFT's Built-in Function
+The script includes `_infer_target_modules()` function that can help identify available modules.
+## Considerations
+### 1. Memory Usage
+- **More modules = More memory**: Training additional layers requires more GPU memory
+- **Monitor VRAM usage**: Use `nvidia-smi` to monitor memory consumption
+- **Adjust batch size**: You may need to reduce `per_device_train_batch_size`
+### 2. Training Time
+- **More modules = Longer training**: Each additional layer increases computation time
+- **Learning rate adjustments**: You might need to reduce `learning_rate` when training more layers
+### 3. Performance Trade-offs
+- **Attention only**: Fast training, good for language understanding
+- **MLP only**: Fast training, good for knowledge storage
+- **Both attention + MLP**: Slower but potentially better performance
+- **All layers**: Slowest but most comprehensive adaptation
+### 4. Model Architecture Differences
+Different model families use different module naming conventions:
+- **LLaMA**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
+- **Qwen**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
+- **Gemma**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
+- **Mixtral**: `mlp.experts.*.w1`, etc.
+## Best Practices
+### 1. Start Conservative
+Begin with just attention layers, then gradually add more modules if needed:
+```yaml
+# Phase 1: Start here
+target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
+# Phase 2: Add MLPs
+target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]
+# Phase 3: Add more if needed
+target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
+```
+### 2. Monitor Overfitting
+- Use evaluation split to monitor performance
+- Adjust `learning_rate` if overfitting occurs
+- Consider `lora_dropout` to reduce overfitting
+### 3. Resource Management
+- Start with small LoRA rank (`r: 16`) if training many modules
+- Increase `gradient_accumulation_steps` if reducing batch size
+- Monitor GPU memory usage throughout training
+### 4. Model-Specific Tuning
+Different models may benefit from different module combinations:
+- **Code models**: Focus on attention + MLP layers
+- **Chat models**: Attention layers are most important
+- **Reasoning models**: All layers might be beneficial
+## Example: Training Custom Modules
+### Complete Configuration Example
+```yaml
+model:
+  repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
+  torch_dtype: "bfloat16"
+peft:
+  enabled: true
+  r: 64
+  lora_alpha: 128
+  lora_dropout: 0.05
+  bias: "none"
+  target_modules:
+    - "q_proj"
+    - "k_proj"
+    - "v_proj"
+    - "o_proj"
+    - "mlp.gate_proj"
+    - "mlp.up_proj"
+    - "mlp.down_proj"
+    - "input_layernorm"
+train:
+  num_train_epochs: 2
+  learning_rate: 1e-5  # Reduced due to more modules
+  per_device_train_batch_size: 1
+  gradient_accumulation_steps: 16
+```
+This configuration will train:
+- All attention projection layers
+- All MLP projection layers
+- Input normalization layers
+- Using a reduced learning rate to accommodate the additional trainable parameters.
+Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.