# DPO Training - Quick Start Guide 🚀 ## Status: ✅ Ready for Training All critical code review fixes have been applied and verified. The DPO trainer is production-ready. ## Prerequisites Checklist - [x] Base model available: `Models/Qwen2.5-Coder-14B-CPT-SFT` - [x] Training data generated: `dpo_pairs_generated.jsonl` (7,612 pairs) - [x] Config file updated: `config_dpo.yaml` - [x] Virtual environment activated: `llm_finetuning_env` - [x] WandB logged in: API key configured - [x] All critical fixes applied and verified ## Start Training ### Option 1: Standard Training (Recommended) ```bash cd /workspace/trainer-kit/DPO-14b python run_dpo.py --config config_dpo.yaml ``` ### Option 2: Background Training (for long runs) ```bash cd /workspace/trainer-kit/DPO-14b nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 & # Monitor progress tail -f training.log # Or check WandB dashboard ``` ### Option 3: Merge Only (if already trained) ```bash python run_dpo.py --config config_dpo.yaml --merge-only ``` ## What to Expect ### Training Configuration - **Base Model**: Qwen2.5-Coder-14B-CPT-SFT (14B parameters) - **Method**: Direct Preference Optimization (DPO) - **Loss**: Sigmoid loss with beta=0.1 - **Data**: 7,612 preference pairs - Train: 6,850 examples - Eval: 762 examples - **Duration**: ~3 epochs - **Batch Size**: Effective batch size = 8 (1 per device × 8 grad accumulation) - **Learning Rate**: 5e-5 with cosine schedule - **LoRA Config**: r=64, alpha=16, dropout=0.1 ### Training Metrics to Monitor 1. **Loss Metrics** - `loss`: Overall DPO loss (should decrease) - `eval_loss`: Validation loss (monitor for overfitting) 2. **Reward Metrics** (Most Important) - `rewards/chosen`: Reward for chosen (preferred) responses - `rewards/rejected`: Reward for rejected responses - **Gap**: `rewards/chosen` should be > `rewards/rejected` - `rewards/accuracies`: % of times chosen > rejected (target: >50%, ideally >70%) - `rewards/margins`: Average difference (chosen - rejected) 3. **Training Dynamics** - `learning_rate`: Should decay with cosine schedule - `grad_norm`: Should be < max_grad_norm (1.0) - `epoch`: Progress through dataset ### Expected Timeline - **Setup**: ~2-5 minutes (model loading, data formatting) - **Training**: ~2-4 hours per epoch (depends on GPU) - 3 epochs total - Evaluation every 100 steps - Checkpoints saved every 500 steps - **Merging**: ~5-10 minutes (LoRA adapter → full model) - **Total**: ~6-12 hours for complete run ### Output Structure ``` runs/dpo_run_14b_v1/ ├── logs/ │ ├── train.jsonl # Training logs (step-by-step) │ └── eval.jsonl # Evaluation logs ├── checkpoints/ │ ├── checkpoint-500/ # Periodic checkpoints │ ├── checkpoint-1000/ │ └── checkpoint-best/ # Best model by eval_loss ├── adapter_14b_dpo_lora/ # Final LoRA adapter └── merged_14b_dpo_lora/ # Merged full model (if merge enabled) ``` ## Monitoring Progress ### 1. Real-time Logs ```bash # Terminal output shows progress cd /workspace/trainer-kit/DPO-14b tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.' ``` ### 2. WandB Dashboard - Project: `qwen-14b-dpo` - Run name: `dpo_qwen14b_[timestamp]` - URL: Will be printed at training start - Metrics refreshed every logging step (default: 10 steps) ### 3. Check GPU Usage ```bash # Monitor GPU memory and utilization watch -n 1 nvidia-smi ``` ### 4. Quick Status Check ```bash # Count checkpoints ls -l runs/dpo_run_14b_v1/checkpoints/ # Check latest log tail runs/dpo_run_14b_v1/logs/train.jsonl ``` ## Troubleshooting ### Out of Memory (OOM) ```yaml # In config_dpo.yaml, reduce batch size: training: per_device_train_batch_size: 1 # Already minimal gradient_accumulation_steps: 4 # Reduce from 8 # Or enable gradient checkpointing (already enabled): model: gradient_checkpointing: true ``` ### Training Divergence (Loss → NaN) - Check learning rate: Reduce from 5e-5 to 2e-5 - Increase beta: Change from 0.1 to 0.2 (more conservative) - Check max_grad_norm: Ensure = 1.0 (clip gradients) ### Slow Training - Verify GPU utilization: Should be >80% - Check `num_proc` in data loading: Default = 4 - Ensure bf16/fp16 enabled (already configured) ### Data Formatting Errors - Check logs for "Failed to format example" warnings - Verify data format: `{"prompt": "...", "chosen": "...", "rejected": "..."}` - Run validation: Already happens automatically ### WandB Connection Issues ```bash # Re-login to WandB wandb login b76f276d3fac6b239147024bf88015de2e20f1bf # Or disable WandB in config: wandb: enabled: false ``` ## Success Criteria Training is successful if: 1. ✅ **Training Completes**: All 3 epochs finish without crashes 2. ✅ **Loss Decreases**: Training loss drops from ~0.69 to <0.50 3. ✅ **Reward Gap**: `rewards/chosen` consistently > `rewards/rejected` 4. ✅ **Accuracy**: `rewards/accuracies` > 60% (ideally 70-80%) 5. ✅ **No Overfitting**: Eval loss doesn't diverge from train loss 6. ✅ **Model Saves**: Final checkpoint and merged model created ## After Training ### 1. Evaluate Model ```bash # Test on held-out data python evaluate_dpo_model.py \ --model runs/dpo_run_14b_v1/merged_14b_dpo_lora \ --test_data ../task2file/sft_qwen_14B/test.jsonl ``` ### 2. Run Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "runs/dpo_run_14b_v1/merged_14b_dpo_lora", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora") # Generate responses messages = [{"role": "user", "content": "Write a Python function to sort a list"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### 3. Compare with Base Model ```bash # Generate responses from both models on same prompts # Compare quality, helpfulness, safety ``` ### 4. Proceed to GRPO (Optional) ```bash # If DPO results are good, train GRPO on top cd ../GRPO-14b # Update config to use DPO model as base python run_grpo.py --config config_grpo.yaml ``` ## Files Reference - `run_dpo.py` - Main training script (954 lines, all fixes applied) - `config_dpo.yaml` - Training configuration - `dpo_pairs_generated.jsonl` - Training data (7,612 pairs) - `f1_score_utils.py` - F1 scoring utilities - `create_synthetic_pairs.py` - Data generation script - `FIXES_APPLIED.md` - Documentation of all fixes - `test_fixes.py` - Verification script - `README.md` - Detailed documentation ## Support For issues: 1. Check logs: `runs/dpo_run_14b_v1/logs/train.jsonl` 2. Review errors: Look for "ERROR" or "WARNING" in output 3. Verify fixes: Run `python test_fixes.py` 4. Check documentation: `FIXES_APPLIED.md`, `README.md` --- **Status**: ✅ All systems ready **Last Verified**: $(date) **Ready to Start**: YES **Command to run:** ```bash cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml ```