task2file-llm / trainer-kit /DPO /QUICK_START.md
SirajRLX's picture
Upload folder using huggingface_hub
4eae728 verified
# DPO Training - Quick Start Guide πŸš€
## Status: βœ… Ready for Training
All critical code review fixes have been applied and verified. The DPO trainer is production-ready.
## Prerequisites Checklist
- [x] Base model available: `Models/Qwen2.5-Coder-14B-CPT-SFT`
- [x] Training data generated: `dpo_pairs_generated.jsonl` (7,612 pairs)
- [x] Config file updated: `config_dpo.yaml`
- [x] Virtual environment activated: `llm_finetuning_env`
- [x] WandB logged in: API key configured
- [x] All critical fixes applied and verified
## Start Training
### Option 1: Standard Training (Recommended)
```bash
cd /workspace/trainer-kit/DPO-14b
python run_dpo.py --config config_dpo.yaml
```
### Option 2: Background Training (for long runs)
```bash
cd /workspace/trainer-kit/DPO-14b
nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 &
# Monitor progress
tail -f training.log
# Or check WandB dashboard
```
### Option 3: Merge Only (if already trained)
```bash
python run_dpo.py --config config_dpo.yaml --merge-only
```
## What to Expect
### Training Configuration
- **Base Model**: Qwen2.5-Coder-14B-CPT-SFT (14B parameters)
- **Method**: Direct Preference Optimization (DPO)
- **Loss**: Sigmoid loss with beta=0.1
- **Data**: 7,612 preference pairs
- Train: 6,850 examples
- Eval: 762 examples
- **Duration**: ~3 epochs
- **Batch Size**: Effective batch size = 8 (1 per device Γ— 8 grad accumulation)
- **Learning Rate**: 5e-5 with cosine schedule
- **LoRA Config**: r=64, alpha=16, dropout=0.1
### Training Metrics to Monitor
1. **Loss Metrics**
- `loss`: Overall DPO loss (should decrease)
- `eval_loss`: Validation loss (monitor for overfitting)
2. **Reward Metrics** (Most Important)
- `rewards/chosen`: Reward for chosen (preferred) responses
- `rewards/rejected`: Reward for rejected responses
- **Gap**: `rewards/chosen` should be > `rewards/rejected`
- `rewards/accuracies`: % of times chosen > rejected (target: >50%, ideally >70%)
- `rewards/margins`: Average difference (chosen - rejected)
3. **Training Dynamics**
- `learning_rate`: Should decay with cosine schedule
- `grad_norm`: Should be < max_grad_norm (1.0)
- `epoch`: Progress through dataset
### Expected Timeline
- **Setup**: ~2-5 minutes (model loading, data formatting)
- **Training**: ~2-4 hours per epoch (depends on GPU)
- 3 epochs total
- Evaluation every 100 steps
- Checkpoints saved every 500 steps
- **Merging**: ~5-10 minutes (LoRA adapter β†’ full model)
- **Total**: ~6-12 hours for complete run
### Output Structure
```
runs/dpo_run_14b_v1/
β”œβ”€β”€ logs/
β”‚ β”œβ”€β”€ train.jsonl # Training logs (step-by-step)
β”‚ └── eval.jsonl # Evaluation logs
β”œβ”€β”€ checkpoints/
β”‚ β”œβ”€β”€ checkpoint-500/ # Periodic checkpoints
β”‚ β”œβ”€β”€ checkpoint-1000/
β”‚ └── checkpoint-best/ # Best model by eval_loss
β”œβ”€β”€ adapter_14b_dpo_lora/ # Final LoRA adapter
└── merged_14b_dpo_lora/ # Merged full model (if merge enabled)
```
## Monitoring Progress
### 1. Real-time Logs
```bash
# Terminal output shows progress
cd /workspace/trainer-kit/DPO-14b
tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.'
```
### 2. WandB Dashboard
- Project: `qwen-14b-dpo`
- Run name: `dpo_qwen14b_[timestamp]`
- URL: Will be printed at training start
- Metrics refreshed every logging step (default: 10 steps)
### 3. Check GPU Usage
```bash
# Monitor GPU memory and utilization
watch -n 1 nvidia-smi
```
### 4. Quick Status Check
```bash
# Count checkpoints
ls -l runs/dpo_run_14b_v1/checkpoints/
# Check latest log
tail runs/dpo_run_14b_v1/logs/train.jsonl
```
## Troubleshooting
### Out of Memory (OOM)
```yaml
# In config_dpo.yaml, reduce batch size:
training:
per_device_train_batch_size: 1 # Already minimal
gradient_accumulation_steps: 4 # Reduce from 8
# Or enable gradient checkpointing (already enabled):
model:
gradient_checkpointing: true
```
### Training Divergence (Loss β†’ NaN)
- Check learning rate: Reduce from 5e-5 to 2e-5
- Increase beta: Change from 0.1 to 0.2 (more conservative)
- Check max_grad_norm: Ensure = 1.0 (clip gradients)
### Slow Training
- Verify GPU utilization: Should be >80%
- Check `num_proc` in data loading: Default = 4
- Ensure bf16/fp16 enabled (already configured)
### Data Formatting Errors
- Check logs for "Failed to format example" warnings
- Verify data format: `{"prompt": "...", "chosen": "...", "rejected": "..."}`
- Run validation: Already happens automatically
### WandB Connection Issues
```bash
# Re-login to WandB
wandb login b76f276d3fac6b239147024bf88015de2e20f1bf
# Or disable WandB in config:
wandb:
enabled: false
```
## Success Criteria
Training is successful if:
1. βœ… **Training Completes**: All 3 epochs finish without crashes
2. βœ… **Loss Decreases**: Training loss drops from ~0.69 to <0.50
3. βœ… **Reward Gap**: `rewards/chosen` consistently > `rewards/rejected`
4. βœ… **Accuracy**: `rewards/accuracies` > 60% (ideally 70-80%)
5. βœ… **No Overfitting**: Eval loss doesn't diverge from train loss
6. βœ… **Model Saves**: Final checkpoint and merged model created
## After Training
### 1. Evaluate Model
```bash
# Test on held-out data
python evaluate_dpo_model.py \
--model runs/dpo_run_14b_v1/merged_14b_dpo_lora \
--test_data ../task2file/sft_qwen_14B/test.jsonl
```
### 2. Run Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"runs/dpo_run_14b_v1/merged_14b_dpo_lora",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora")
# Generate responses
messages = [{"role": "user", "content": "Write a Python function to sort a list"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### 3. Compare with Base Model
```bash
# Generate responses from both models on same prompts
# Compare quality, helpfulness, safety
```
### 4. Proceed to GRPO (Optional)
```bash
# If DPO results are good, train GRPO on top
cd ../GRPO-14b
# Update config to use DPO model as base
python run_grpo.py --config config_grpo.yaml
```
## Files Reference
- `run_dpo.py` - Main training script (954 lines, all fixes applied)
- `config_dpo.yaml` - Training configuration
- `dpo_pairs_generated.jsonl` - Training data (7,612 pairs)
- `f1_score_utils.py` - F1 scoring utilities
- `create_synthetic_pairs.py` - Data generation script
- `FIXES_APPLIED.md` - Documentation of all fixes
- `test_fixes.py` - Verification script
- `README.md` - Detailed documentation
## Support
For issues:
1. Check logs: `runs/dpo_run_14b_v1/logs/train.jsonl`
2. Review errors: Look for "ERROR" or "WARNING" in output
3. Verify fixes: Run `python test_fixes.py`
4. Check documentation: `FIXES_APPLIED.md`, `README.md`
---
**Status**: βœ… All systems ready
**Last Verified**: $(date)
**Ready to Start**: YES
**Command to run:**
```bash
cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml
```