| # DPO Training - Quick Start Guide π | |
| ## Status: β Ready for Training | |
| All critical code review fixes have been applied and verified. The DPO trainer is production-ready. | |
| ## Prerequisites Checklist | |
| - [x] Base model available: `Models/Qwen2.5-Coder-14B-CPT-SFT` | |
| - [x] Training data generated: `dpo_pairs_generated.jsonl` (7,612 pairs) | |
| - [x] Config file updated: `config_dpo.yaml` | |
| - [x] Virtual environment activated: `llm_finetuning_env` | |
| - [x] WandB logged in: API key configured | |
| - [x] All critical fixes applied and verified | |
| ## Start Training | |
| ### Option 1: Standard Training (Recommended) | |
| ```bash | |
| cd /workspace/trainer-kit/DPO-14b | |
| python run_dpo.py --config config_dpo.yaml | |
| ``` | |
| ### Option 2: Background Training (for long runs) | |
| ```bash | |
| cd /workspace/trainer-kit/DPO-14b | |
| nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 & | |
| # Monitor progress | |
| tail -f training.log | |
| # Or check WandB dashboard | |
| ``` | |
| ### Option 3: Merge Only (if already trained) | |
| ```bash | |
| python run_dpo.py --config config_dpo.yaml --merge-only | |
| ``` | |
| ## What to Expect | |
| ### Training Configuration | |
| - **Base Model**: Qwen2.5-Coder-14B-CPT-SFT (14B parameters) | |
| - **Method**: Direct Preference Optimization (DPO) | |
| - **Loss**: Sigmoid loss with beta=0.1 | |
| - **Data**: 7,612 preference pairs | |
| - Train: 6,850 examples | |
| - Eval: 762 examples | |
| - **Duration**: ~3 epochs | |
| - **Batch Size**: Effective batch size = 8 (1 per device Γ 8 grad accumulation) | |
| - **Learning Rate**: 5e-5 with cosine schedule | |
| - **LoRA Config**: r=64, alpha=16, dropout=0.1 | |
| ### Training Metrics to Monitor | |
| 1. **Loss Metrics** | |
| - `loss`: Overall DPO loss (should decrease) | |
| - `eval_loss`: Validation loss (monitor for overfitting) | |
| 2. **Reward Metrics** (Most Important) | |
| - `rewards/chosen`: Reward for chosen (preferred) responses | |
| - `rewards/rejected`: Reward for rejected responses | |
| - **Gap**: `rewards/chosen` should be > `rewards/rejected` | |
| - `rewards/accuracies`: % of times chosen > rejected (target: >50%, ideally >70%) | |
| - `rewards/margins`: Average difference (chosen - rejected) | |
| 3. **Training Dynamics** | |
| - `learning_rate`: Should decay with cosine schedule | |
| - `grad_norm`: Should be < max_grad_norm (1.0) | |
| - `epoch`: Progress through dataset | |
| ### Expected Timeline | |
| - **Setup**: ~2-5 minutes (model loading, data formatting) | |
| - **Training**: ~2-4 hours per epoch (depends on GPU) | |
| - 3 epochs total | |
| - Evaluation every 100 steps | |
| - Checkpoints saved every 500 steps | |
| - **Merging**: ~5-10 minutes (LoRA adapter β full model) | |
| - **Total**: ~6-12 hours for complete run | |
| ### Output Structure | |
| ``` | |
| runs/dpo_run_14b_v1/ | |
| βββ logs/ | |
| β βββ train.jsonl # Training logs (step-by-step) | |
| β βββ eval.jsonl # Evaluation logs | |
| βββ checkpoints/ | |
| β βββ checkpoint-500/ # Periodic checkpoints | |
| β βββ checkpoint-1000/ | |
| β βββ checkpoint-best/ # Best model by eval_loss | |
| βββ adapter_14b_dpo_lora/ # Final LoRA adapter | |
| βββ merged_14b_dpo_lora/ # Merged full model (if merge enabled) | |
| ``` | |
| ## Monitoring Progress | |
| ### 1. Real-time Logs | |
| ```bash | |
| # Terminal output shows progress | |
| cd /workspace/trainer-kit/DPO-14b | |
| tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.' | |
| ``` | |
| ### 2. WandB Dashboard | |
| - Project: `qwen-14b-dpo` | |
| - Run name: `dpo_qwen14b_[timestamp]` | |
| - URL: Will be printed at training start | |
| - Metrics refreshed every logging step (default: 10 steps) | |
| ### 3. Check GPU Usage | |
| ```bash | |
| # Monitor GPU memory and utilization | |
| watch -n 1 nvidia-smi | |
| ``` | |
| ### 4. Quick Status Check | |
| ```bash | |
| # Count checkpoints | |
| ls -l runs/dpo_run_14b_v1/checkpoints/ | |
| # Check latest log | |
| tail runs/dpo_run_14b_v1/logs/train.jsonl | |
| ``` | |
| ## Troubleshooting | |
| ### Out of Memory (OOM) | |
| ```yaml | |
| # In config_dpo.yaml, reduce batch size: | |
| training: | |
| per_device_train_batch_size: 1 # Already minimal | |
| gradient_accumulation_steps: 4 # Reduce from 8 | |
| # Or enable gradient checkpointing (already enabled): | |
| model: | |
| gradient_checkpointing: true | |
| ``` | |
| ### Training Divergence (Loss β NaN) | |
| - Check learning rate: Reduce from 5e-5 to 2e-5 | |
| - Increase beta: Change from 0.1 to 0.2 (more conservative) | |
| - Check max_grad_norm: Ensure = 1.0 (clip gradients) | |
| ### Slow Training | |
| - Verify GPU utilization: Should be >80% | |
| - Check `num_proc` in data loading: Default = 4 | |
| - Ensure bf16/fp16 enabled (already configured) | |
| ### Data Formatting Errors | |
| - Check logs for "Failed to format example" warnings | |
| - Verify data format: `{"prompt": "...", "chosen": "...", "rejected": "..."}` | |
| - Run validation: Already happens automatically | |
| ### WandB Connection Issues | |
| ```bash | |
| # Re-login to WandB | |
| wandb login b76f276d3fac6b239147024bf88015de2e20f1bf | |
| # Or disable WandB in config: | |
| wandb: | |
| enabled: false | |
| ``` | |
| ## Success Criteria | |
| Training is successful if: | |
| 1. β **Training Completes**: All 3 epochs finish without crashes | |
| 2. β **Loss Decreases**: Training loss drops from ~0.69 to <0.50 | |
| 3. β **Reward Gap**: `rewards/chosen` consistently > `rewards/rejected` | |
| 4. β **Accuracy**: `rewards/accuracies` > 60% (ideally 70-80%) | |
| 5. β **No Overfitting**: Eval loss doesn't diverge from train loss | |
| 6. β **Model Saves**: Final checkpoint and merged model created | |
| ## After Training | |
| ### 1. Evaluate Model | |
| ```bash | |
| # Test on held-out data | |
| python evaluate_dpo_model.py \ | |
| --model runs/dpo_run_14b_v1/merged_14b_dpo_lora \ | |
| --test_data ../task2file/sft_qwen_14B/test.jsonl | |
| ``` | |
| ### 2. Run Inference | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "runs/dpo_run_14b_v1/merged_14b_dpo_lora", | |
| torch_dtype="auto", | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora") | |
| # Generate responses | |
| messages = [{"role": "user", "content": "Write a Python function to sort a list"}] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=512) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### 3. Compare with Base Model | |
| ```bash | |
| # Generate responses from both models on same prompts | |
| # Compare quality, helpfulness, safety | |
| ``` | |
| ### 4. Proceed to GRPO (Optional) | |
| ```bash | |
| # If DPO results are good, train GRPO on top | |
| cd ../GRPO-14b | |
| # Update config to use DPO model as base | |
| python run_grpo.py --config config_grpo.yaml | |
| ``` | |
| ## Files Reference | |
| - `run_dpo.py` - Main training script (954 lines, all fixes applied) | |
| - `config_dpo.yaml` - Training configuration | |
| - `dpo_pairs_generated.jsonl` - Training data (7,612 pairs) | |
| - `f1_score_utils.py` - F1 scoring utilities | |
| - `create_synthetic_pairs.py` - Data generation script | |
| - `FIXES_APPLIED.md` - Documentation of all fixes | |
| - `test_fixes.py` - Verification script | |
| - `README.md` - Detailed documentation | |
| ## Support | |
| For issues: | |
| 1. Check logs: `runs/dpo_run_14b_v1/logs/train.jsonl` | |
| 2. Review errors: Look for "ERROR" or "WARNING" in output | |
| 3. Verify fixes: Run `python test_fixes.py` | |
| 4. Check documentation: `FIXES_APPLIED.md`, `README.md` | |
| --- | |
| **Status**: β All systems ready | |
| **Last Verified**: $(date) | |
| **Ready to Start**: YES | |
| **Command to run:** | |
| ```bash | |
| cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml | |
| ``` | |