task2file-llm / trainer-kit /DPO /QUICK_START.md
SirajRLX's picture
Upload folder using huggingface_hub
4eae728 verified

DPO Training - Quick Start Guide πŸš€

Status: βœ… Ready for Training

All critical code review fixes have been applied and verified. The DPO trainer is production-ready.

Prerequisites Checklist

  • Base model available: Models/Qwen2.5-Coder-14B-CPT-SFT
  • Training data generated: dpo_pairs_generated.jsonl (7,612 pairs)
  • Config file updated: config_dpo.yaml
  • Virtual environment activated: llm_finetuning_env
  • WandB logged in: API key configured
  • All critical fixes applied and verified

Start Training

Option 1: Standard Training (Recommended)

cd /workspace/trainer-kit/DPO-14b
python run_dpo.py --config config_dpo.yaml

Option 2: Background Training (for long runs)

cd /workspace/trainer-kit/DPO-14b
nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 &

# Monitor progress
tail -f training.log

# Or check WandB dashboard

Option 3: Merge Only (if already trained)

python run_dpo.py --config config_dpo.yaml --merge-only

What to Expect

Training Configuration

  • Base Model: Qwen2.5-Coder-14B-CPT-SFT (14B parameters)
  • Method: Direct Preference Optimization (DPO)
  • Loss: Sigmoid loss with beta=0.1
  • Data: 7,612 preference pairs
    • Train: 6,850 examples
    • Eval: 762 examples
  • Duration: ~3 epochs
  • Batch Size: Effective batch size = 8 (1 per device Γ— 8 grad accumulation)
  • Learning Rate: 5e-5 with cosine schedule
  • LoRA Config: r=64, alpha=16, dropout=0.1

Training Metrics to Monitor

  1. Loss Metrics

    • loss: Overall DPO loss (should decrease)
    • eval_loss: Validation loss (monitor for overfitting)
  2. Reward Metrics (Most Important)

    • rewards/chosen: Reward for chosen (preferred) responses
    • rewards/rejected: Reward for rejected responses
    • Gap: rewards/chosen should be > rewards/rejected
    • rewards/accuracies: % of times chosen > rejected (target: >50%, ideally >70%)
    • rewards/margins: Average difference (chosen - rejected)
  3. Training Dynamics

    • learning_rate: Should decay with cosine schedule
    • grad_norm: Should be < max_grad_norm (1.0)
    • epoch: Progress through dataset

Expected Timeline

  • Setup: ~2-5 minutes (model loading, data formatting)
  • Training: ~2-4 hours per epoch (depends on GPU)
    • 3 epochs total
    • Evaluation every 100 steps
    • Checkpoints saved every 500 steps
  • Merging: ~5-10 minutes (LoRA adapter β†’ full model)
  • Total: ~6-12 hours for complete run

Output Structure

runs/dpo_run_14b_v1/
β”œβ”€β”€ logs/
β”‚   β”œβ”€β”€ train.jsonl           # Training logs (step-by-step)
β”‚   └── eval.jsonl             # Evaluation logs
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ checkpoint-500/        # Periodic checkpoints
β”‚   β”œβ”€β”€ checkpoint-1000/
β”‚   └── checkpoint-best/       # Best model by eval_loss
β”œβ”€β”€ adapter_14b_dpo_lora/      # Final LoRA adapter
└── merged_14b_dpo_lora/       # Merged full model (if merge enabled)

Monitoring Progress

1. Real-time Logs

# Terminal output shows progress
cd /workspace/trainer-kit/DPO-14b
tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.'

2. WandB Dashboard

  • Project: qwen-14b-dpo
  • Run name: dpo_qwen14b_[timestamp]
  • URL: Will be printed at training start
  • Metrics refreshed every logging step (default: 10 steps)

3. Check GPU Usage

# Monitor GPU memory and utilization
watch -n 1 nvidia-smi

4. Quick Status Check

# Count checkpoints
ls -l runs/dpo_run_14b_v1/checkpoints/

# Check latest log
tail runs/dpo_run_14b_v1/logs/train.jsonl

Troubleshooting

Out of Memory (OOM)

# In config_dpo.yaml, reduce batch size:
training:
  per_device_train_batch_size: 1  # Already minimal
  gradient_accumulation_steps: 4  # Reduce from 8
  
# Or enable gradient checkpointing (already enabled):
model:
  gradient_checkpointing: true

Training Divergence (Loss β†’ NaN)

  • Check learning rate: Reduce from 5e-5 to 2e-5
  • Increase beta: Change from 0.1 to 0.2 (more conservative)
  • Check max_grad_norm: Ensure = 1.0 (clip gradients)

Slow Training

  • Verify GPU utilization: Should be >80%
  • Check num_proc in data loading: Default = 4
  • Ensure bf16/fp16 enabled (already configured)

Data Formatting Errors

  • Check logs for "Failed to format example" warnings
  • Verify data format: {"prompt": "...", "chosen": "...", "rejected": "..."}
  • Run validation: Already happens automatically

WandB Connection Issues

# Re-login to WandB
wandb login b76f276d3fac6b239147024bf88015de2e20f1bf

# Or disable WandB in config:
wandb:
  enabled: false

Success Criteria

Training is successful if:

  1. βœ… Training Completes: All 3 epochs finish without crashes
  2. βœ… Loss Decreases: Training loss drops from ~0.69 to <0.50
  3. βœ… Reward Gap: rewards/chosen consistently > rewards/rejected
  4. βœ… Accuracy: rewards/accuracies > 60% (ideally 70-80%)
  5. βœ… No Overfitting: Eval loss doesn't diverge from train loss
  6. βœ… Model Saves: Final checkpoint and merged model created

After Training

1. Evaluate Model

# Test on held-out data
python evaluate_dpo_model.py \
  --model runs/dpo_run_14b_v1/merged_14b_dpo_lora \
  --test_data ../task2file/sft_qwen_14B/test.jsonl

2. Run Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "runs/dpo_run_14b_v1/merged_14b_dpo_lora",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora")

# Generate responses
messages = [{"role": "user", "content": "Write a Python function to sort a list"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

3. Compare with Base Model

# Generate responses from both models on same prompts
# Compare quality, helpfulness, safety

4. Proceed to GRPO (Optional)

# If DPO results are good, train GRPO on top
cd ../GRPO-14b
# Update config to use DPO model as base
python run_grpo.py --config config_grpo.yaml

Files Reference

  • run_dpo.py - Main training script (954 lines, all fixes applied)
  • config_dpo.yaml - Training configuration
  • dpo_pairs_generated.jsonl - Training data (7,612 pairs)
  • f1_score_utils.py - F1 scoring utilities
  • create_synthetic_pairs.py - Data generation script
  • FIXES_APPLIED.md - Documentation of all fixes
  • test_fixes.py - Verification script
  • README.md - Detailed documentation

Support

For issues:

  1. Check logs: runs/dpo_run_14b_v1/logs/train.jsonl
  2. Review errors: Look for "ERROR" or "WARNING" in output
  3. Verify fixes: Run python test_fixes.py
  4. Check documentation: FIXES_APPLIED.md, README.md

Status: βœ… All systems ready
Last Verified: $(date)
Ready to Start: YES

Command to run:

cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml