task2file-llm / trainer-kit /DPO /QUICK_START.md

Upload folder using huggingface_hub

4eae728 verified about 1 month ago

7.29 kB

	# DPO Training - Quick Start Guide 🚀

	## Status: ✅ Ready for Training

	All critical code review fixes have been applied and verified. The DPO trainer is production-ready.

	## Prerequisites Checklist

	- [x] Base model available: `Models/Qwen2.5-Coder-14B-CPT-SFT`
	- [x] Training data generated: `dpo_pairs_generated.jsonl` (7,612 pairs)
	- [x] Config file updated: `config_dpo.yaml`
	- [x] Virtual environment activated: `llm_finetuning_env`
	- [x] WandB logged in: API key configured
	- [x] All critical fixes applied and verified

	## Start Training

	### Option 1: Standard Training (Recommended)
	```bash
	cd /workspace/trainer-kit/DPO-14b
	python run_dpo.py --config config_dpo.yaml
	```

	### Option 2: Background Training (for long runs)
	```bash
	cd /workspace/trainer-kit/DPO-14b
	nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 &

	# Monitor progress
	tail -f training.log

	# Or check WandB dashboard
	```

	### Option 3: Merge Only (if already trained)
	```bash
	python run_dpo.py --config config_dpo.yaml --merge-only
	```

	## What to Expect

	### Training Configuration
	- Base Model: Qwen2.5-Coder-14B-CPT-SFT (14B parameters)
	- Method: Direct Preference Optimization (DPO)
	- Loss: Sigmoid loss with beta=0.1
	- Data: 7,612 preference pairs
	- Train: 6,850 examples
	- Eval: 762 examples
	- Duration: ~3 epochs
	- Batch Size: Effective batch size = 8 (1 per device × 8 grad accumulation)
	- Learning Rate: 5e-5 with cosine schedule
	- LoRA Config: r=64, alpha=16, dropout=0.1

	### Training Metrics to Monitor

	1. Loss Metrics
	- `loss`: Overall DPO loss (should decrease)
	- `eval_loss`: Validation loss (monitor for overfitting)

	2. Reward Metrics (Most Important)
	- `rewards/chosen`: Reward for chosen (preferred) responses
	- `rewards/rejected`: Reward for rejected responses
	- Gap: `rewards/chosen` should be > `rewards/rejected`
	- `rewards/accuracies`: % of times chosen > rejected (target: >50%, ideally >70%)
	- `rewards/margins`: Average difference (chosen - rejected)

	3. Training Dynamics
	- `learning_rate`: Should decay with cosine schedule
	- `grad_norm`: Should be < max_grad_norm (1.0)
	- `epoch`: Progress through dataset

	### Expected Timeline

	- Setup: ~2-5 minutes (model loading, data formatting)
	- Training: ~2-4 hours per epoch (depends on GPU)
	- 3 epochs total
	- Evaluation every 100 steps
	- Checkpoints saved every 500 steps
	- Merging: ~5-10 minutes (LoRA adapter → full model)
	- Total: ~6-12 hours for complete run

	### Output Structure

	```
	runs/dpo_run_14b_v1/
	├── logs/
	│ ├── train.jsonl # Training logs (step-by-step)
	│ └── eval.jsonl # Evaluation logs
	├── checkpoints/
	│ ├── checkpoint-500/ # Periodic checkpoints
	│ ├── checkpoint-1000/
	│ └── checkpoint-best/ # Best model by eval_loss
	├── adapter_14b_dpo_lora/ # Final LoRA adapter
	└── merged_14b_dpo_lora/ # Merged full model (if merge enabled)
	```

	## Monitoring Progress

	### 1. Real-time Logs
	```bash
	# Terminal output shows progress
	cd /workspace/trainer-kit/DPO-14b
	tail -f runs/dpo_run_14b_v1/logs/train.jsonl \| jq '.'
	```

	### 2. WandB Dashboard
	- Project: `qwen-14b-dpo`
	- Run name: `dpo_qwen14b_[timestamp]`
	- URL: Will be printed at training start
	- Metrics refreshed every logging step (default: 10 steps)

	### 3. Check GPU Usage
	```bash
	# Monitor GPU memory and utilization
	watch -n 1 nvidia-smi
	```

	### 4. Quick Status Check
	```bash
	# Count checkpoints
	ls -l runs/dpo_run_14b_v1/checkpoints/

	# Check latest log
	tail runs/dpo_run_14b_v1/logs/train.jsonl
	```

	## Troubleshooting

	### Out of Memory (OOM)
	```yaml
	# In config_dpo.yaml, reduce batch size:
	training:
	per_device_train_batch_size: 1 # Already minimal
	gradient_accumulation_steps: 4 # Reduce from 8

	# Or enable gradient checkpointing (already enabled):
	model:
	gradient_checkpointing: true
	```

	### Training Divergence (Loss → NaN)
	- Check learning rate: Reduce from 5e-5 to 2e-5
	- Increase beta: Change from 0.1 to 0.2 (more conservative)
	- Check max_grad_norm: Ensure = 1.0 (clip gradients)

	### Slow Training
	- Verify GPU utilization: Should be >80%
	- Check `num_proc` in data loading: Default = 4
	- Ensure bf16/fp16 enabled (already configured)

	### Data Formatting Errors
	- Check logs for "Failed to format example" warnings
	- Verify data format: `{"prompt": "...", "chosen": "...", "rejected": "..."}`
	- Run validation: Already happens automatically

	### WandB Connection Issues
	```bash
	# Re-login to WandB
	wandb login b76f276d3fac6b239147024bf88015de2e20f1bf

	# Or disable WandB in config:
	wandb:
	enabled: false
	```

	## Success Criteria

	Training is successful if:

	1. ✅ Training Completes: All 3 epochs finish without crashes
	2. ✅ Loss Decreases: Training loss drops from ~0.69 to <0.50
	3. ✅ Reward Gap: `rewards/chosen` consistently > `rewards/rejected`
	4. ✅ Accuracy: `rewards/accuracies` > 60% (ideally 70-80%)
	5. ✅ No Overfitting: Eval loss doesn't diverge from train loss
	6. ✅ Model Saves: Final checkpoint and merged model created

	## After Training

	### 1. Evaluate Model
	```bash
	# Test on held-out data
	python evaluate_dpo_model.py \
	--model runs/dpo_run_14b_v1/merged_14b_dpo_lora \
	--test_data ../task2file/sft_qwen_14B/test.jsonl
	```

	### 2. Run Inference
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"runs/dpo_run_14b_v1/merged_14b_dpo_lora",
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora")

	# Generate responses
	messages = [{"role": "user", "content": "Write a Python function to sort a list"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### 3. Compare with Base Model
	```bash
	# Generate responses from both models on same prompts
	# Compare quality, helpfulness, safety
	```

	### 4. Proceed to GRPO (Optional)
	```bash
	# If DPO results are good, train GRPO on top
	cd ../GRPO-14b
	# Update config to use DPO model as base
	python run_grpo.py --config config_grpo.yaml
	```

	## Files Reference

	- `run_dpo.py` - Main training script (954 lines, all fixes applied)
	- `config_dpo.yaml` - Training configuration
	- `dpo_pairs_generated.jsonl` - Training data (7,612 pairs)
	- `f1_score_utils.py` - F1 scoring utilities
	- `create_synthetic_pairs.py` - Data generation script
	- `FIXES_APPLIED.md` - Documentation of all fixes
	- `test_fixes.py` - Verification script
	- `README.md` - Detailed documentation

	## Support

	For issues:
	1. Check logs: `runs/dpo_run_14b_v1/logs/train.jsonl`
	2. Review errors: Look for "ERROR" or "WARNING" in output
	3. Verify fixes: Run `python test_fixes.py`
	4. Check documentation: `FIXES_APPLIED.md`, `README.md`

	---

	Status: ✅ All systems ready
	Last Verified: $(date)
	Ready to Start: YES

	Command to run:
	```bash
	cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml
	```