DPO Training - Quick Start Guide π
Status: β Ready for Training
All critical code review fixes have been applied and verified. The DPO trainer is production-ready.
Prerequisites Checklist
- Base model available:
Models/Qwen2.5-Coder-14B-CPT-SFT - Training data generated:
dpo_pairs_generated.jsonl(7,612 pairs) - Config file updated:
config_dpo.yaml - Virtual environment activated:
llm_finetuning_env - WandB logged in: API key configured
- All critical fixes applied and verified
Start Training
Option 1: Standard Training (Recommended)
cd /workspace/trainer-kit/DPO-14b
python run_dpo.py --config config_dpo.yaml
Option 2: Background Training (for long runs)
cd /workspace/trainer-kit/DPO-14b
nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 &
# Monitor progress
tail -f training.log
# Or check WandB dashboard
Option 3: Merge Only (if already trained)
python run_dpo.py --config config_dpo.yaml --merge-only
What to Expect
Training Configuration
- Base Model: Qwen2.5-Coder-14B-CPT-SFT (14B parameters)
- Method: Direct Preference Optimization (DPO)
- Loss: Sigmoid loss with beta=0.1
- Data: 7,612 preference pairs
- Train: 6,850 examples
- Eval: 762 examples
- Duration: ~3 epochs
- Batch Size: Effective batch size = 8 (1 per device Γ 8 grad accumulation)
- Learning Rate: 5e-5 with cosine schedule
- LoRA Config: r=64, alpha=16, dropout=0.1
Training Metrics to Monitor
Loss Metrics
loss: Overall DPO loss (should decrease)eval_loss: Validation loss (monitor for overfitting)
Reward Metrics (Most Important)
rewards/chosen: Reward for chosen (preferred) responsesrewards/rejected: Reward for rejected responses- Gap:
rewards/chosenshould be >rewards/rejected rewards/accuracies: % of times chosen > rejected (target: >50%, ideally >70%)rewards/margins: Average difference (chosen - rejected)
Training Dynamics
learning_rate: Should decay with cosine schedulegrad_norm: Should be < max_grad_norm (1.0)epoch: Progress through dataset
Expected Timeline
- Setup: ~2-5 minutes (model loading, data formatting)
- Training: ~2-4 hours per epoch (depends on GPU)
- 3 epochs total
- Evaluation every 100 steps
- Checkpoints saved every 500 steps
- Merging: ~5-10 minutes (LoRA adapter β full model)
- Total: ~6-12 hours for complete run
Output Structure
runs/dpo_run_14b_v1/
βββ logs/
β βββ train.jsonl # Training logs (step-by-step)
β βββ eval.jsonl # Evaluation logs
βββ checkpoints/
β βββ checkpoint-500/ # Periodic checkpoints
β βββ checkpoint-1000/
β βββ checkpoint-best/ # Best model by eval_loss
βββ adapter_14b_dpo_lora/ # Final LoRA adapter
βββ merged_14b_dpo_lora/ # Merged full model (if merge enabled)
Monitoring Progress
1. Real-time Logs
# Terminal output shows progress
cd /workspace/trainer-kit/DPO-14b
tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.'
2. WandB Dashboard
- Project:
qwen-14b-dpo - Run name:
dpo_qwen14b_[timestamp] - URL: Will be printed at training start
- Metrics refreshed every logging step (default: 10 steps)
3. Check GPU Usage
# Monitor GPU memory and utilization
watch -n 1 nvidia-smi
4. Quick Status Check
# Count checkpoints
ls -l runs/dpo_run_14b_v1/checkpoints/
# Check latest log
tail runs/dpo_run_14b_v1/logs/train.jsonl
Troubleshooting
Out of Memory (OOM)
# In config_dpo.yaml, reduce batch size:
training:
per_device_train_batch_size: 1 # Already minimal
gradient_accumulation_steps: 4 # Reduce from 8
# Or enable gradient checkpointing (already enabled):
model:
gradient_checkpointing: true
Training Divergence (Loss β NaN)
- Check learning rate: Reduce from 5e-5 to 2e-5
- Increase beta: Change from 0.1 to 0.2 (more conservative)
- Check max_grad_norm: Ensure = 1.0 (clip gradients)
Slow Training
- Verify GPU utilization: Should be >80%
- Check
num_procin data loading: Default = 4 - Ensure bf16/fp16 enabled (already configured)
Data Formatting Errors
- Check logs for "Failed to format example" warnings
- Verify data format:
{"prompt": "...", "chosen": "...", "rejected": "..."} - Run validation: Already happens automatically
WandB Connection Issues
# Re-login to WandB
wandb login b76f276d3fac6b239147024bf88015de2e20f1bf
# Or disable WandB in config:
wandb:
enabled: false
Success Criteria
Training is successful if:
- β Training Completes: All 3 epochs finish without crashes
- β Loss Decreases: Training loss drops from ~0.69 to <0.50
- β
Reward Gap:
rewards/chosenconsistently >rewards/rejected - β
Accuracy:
rewards/accuracies> 60% (ideally 70-80%) - β No Overfitting: Eval loss doesn't diverge from train loss
- β Model Saves: Final checkpoint and merged model created
After Training
1. Evaluate Model
# Test on held-out data
python evaluate_dpo_model.py \
--model runs/dpo_run_14b_v1/merged_14b_dpo_lora \
--test_data ../task2file/sft_qwen_14B/test.jsonl
2. Run Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"runs/dpo_run_14b_v1/merged_14b_dpo_lora",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora")
# Generate responses
messages = [{"role": "user", "content": "Write a Python function to sort a list"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
3. Compare with Base Model
# Generate responses from both models on same prompts
# Compare quality, helpfulness, safety
4. Proceed to GRPO (Optional)
# If DPO results are good, train GRPO on top
cd ../GRPO-14b
# Update config to use DPO model as base
python run_grpo.py --config config_grpo.yaml
Files Reference
run_dpo.py- Main training script (954 lines, all fixes applied)config_dpo.yaml- Training configurationdpo_pairs_generated.jsonl- Training data (7,612 pairs)f1_score_utils.py- F1 scoring utilitiescreate_synthetic_pairs.py- Data generation scriptFIXES_APPLIED.md- Documentation of all fixestest_fixes.py- Verification scriptREADME.md- Detailed documentation
Support
For issues:
- Check logs:
runs/dpo_run_14b_v1/logs/train.jsonl - Review errors: Look for "ERROR" or "WARNING" in output
- Verify fixes: Run
python test_fixes.py - Check documentation:
FIXES_APPLIED.md,README.md
Status: β
All systems ready
Last Verified: $(date)
Ready to Start: YES
Command to run:
cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml