File size: 7,294 Bytes
4eae728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# DPO Training - Quick Start Guide πŸš€

## Status: βœ… Ready for Training

All critical code review fixes have been applied and verified. The DPO trainer is production-ready.

## Prerequisites Checklist

- [x] Base model available: `Models/Qwen2.5-Coder-14B-CPT-SFT`
- [x] Training data generated: `dpo_pairs_generated.jsonl` (7,612 pairs)
- [x] Config file updated: `config_dpo.yaml`
- [x] Virtual environment activated: `llm_finetuning_env`
- [x] WandB logged in: API key configured
- [x] All critical fixes applied and verified

## Start Training

### Option 1: Standard Training (Recommended)
```bash
cd /workspace/trainer-kit/DPO-14b
python run_dpo.py --config config_dpo.yaml
```

### Option 2: Background Training (for long runs)
```bash
cd /workspace/trainer-kit/DPO-14b
nohup python run_dpo.py --config config_dpo.yaml > training.log 2>&1 &

# Monitor progress
tail -f training.log

# Or check WandB dashboard
```

### Option 3: Merge Only (if already trained)
```bash
python run_dpo.py --config config_dpo.yaml --merge-only
```

## What to Expect

### Training Configuration
- **Base Model**: Qwen2.5-Coder-14B-CPT-SFT (14B parameters)
- **Method**: Direct Preference Optimization (DPO)
- **Loss**: Sigmoid loss with beta=0.1
- **Data**: 7,612 preference pairs
  - Train: 6,850 examples
  - Eval: 762 examples
- **Duration**: ~3 epochs
- **Batch Size**: Effective batch size = 8 (1 per device Γ— 8 grad accumulation)
- **Learning Rate**: 5e-5 with cosine schedule
- **LoRA Config**: r=64, alpha=16, dropout=0.1

### Training Metrics to Monitor

1. **Loss Metrics**
   - `loss`: Overall DPO loss (should decrease)
   - `eval_loss`: Validation loss (monitor for overfitting)

2. **Reward Metrics** (Most Important)
   - `rewards/chosen`: Reward for chosen (preferred) responses
   - `rewards/rejected`: Reward for rejected responses
   - **Gap**: `rewards/chosen` should be > `rewards/rejected`
   - `rewards/accuracies`: % of times chosen > rejected (target: >50%, ideally >70%)
   - `rewards/margins`: Average difference (chosen - rejected)

3. **Training Dynamics**
   - `learning_rate`: Should decay with cosine schedule
   - `grad_norm`: Should be < max_grad_norm (1.0)
   - `epoch`: Progress through dataset

### Expected Timeline

- **Setup**: ~2-5 minutes (model loading, data formatting)
- **Training**: ~2-4 hours per epoch (depends on GPU)
  - 3 epochs total
  - Evaluation every 100 steps
  - Checkpoints saved every 500 steps
- **Merging**: ~5-10 minutes (LoRA adapter β†’ full model)
- **Total**: ~6-12 hours for complete run

### Output Structure

```
runs/dpo_run_14b_v1/
β”œβ”€β”€ logs/
β”‚   β”œβ”€β”€ train.jsonl           # Training logs (step-by-step)
β”‚   └── eval.jsonl             # Evaluation logs
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ checkpoint-500/        # Periodic checkpoints
β”‚   β”œβ”€β”€ checkpoint-1000/
β”‚   └── checkpoint-best/       # Best model by eval_loss
β”œβ”€β”€ adapter_14b_dpo_lora/      # Final LoRA adapter
└── merged_14b_dpo_lora/       # Merged full model (if merge enabled)
```

## Monitoring Progress

### 1. Real-time Logs
```bash
# Terminal output shows progress
cd /workspace/trainer-kit/DPO-14b
tail -f runs/dpo_run_14b_v1/logs/train.jsonl | jq '.'
```

### 2. WandB Dashboard
- Project: `qwen-14b-dpo`
- Run name: `dpo_qwen14b_[timestamp]`
- URL: Will be printed at training start
- Metrics refreshed every logging step (default: 10 steps)

### 3. Check GPU Usage
```bash
# Monitor GPU memory and utilization
watch -n 1 nvidia-smi
```

### 4. Quick Status Check
```bash
# Count checkpoints
ls -l runs/dpo_run_14b_v1/checkpoints/

# Check latest log
tail runs/dpo_run_14b_v1/logs/train.jsonl
```

## Troubleshooting

### Out of Memory (OOM)
```yaml
# In config_dpo.yaml, reduce batch size:
training:
  per_device_train_batch_size: 1  # Already minimal
  gradient_accumulation_steps: 4  # Reduce from 8
  
# Or enable gradient checkpointing (already enabled):
model:
  gradient_checkpointing: true
```

### Training Divergence (Loss β†’ NaN)
- Check learning rate: Reduce from 5e-5 to 2e-5
- Increase beta: Change from 0.1 to 0.2 (more conservative)
- Check max_grad_norm: Ensure = 1.0 (clip gradients)

### Slow Training
- Verify GPU utilization: Should be >80%
- Check `num_proc` in data loading: Default = 4
- Ensure bf16/fp16 enabled (already configured)

### Data Formatting Errors
- Check logs for "Failed to format example" warnings
- Verify data format: `{"prompt": "...", "chosen": "...", "rejected": "..."}`
- Run validation: Already happens automatically

### WandB Connection Issues
```bash
# Re-login to WandB
wandb login b76f276d3fac6b239147024bf88015de2e20f1bf

# Or disable WandB in config:
wandb:
  enabled: false
```

## Success Criteria

Training is successful if:

1. βœ… **Training Completes**: All 3 epochs finish without crashes
2. βœ… **Loss Decreases**: Training loss drops from ~0.69 to <0.50
3. βœ… **Reward Gap**: `rewards/chosen` consistently > `rewards/rejected`
4. βœ… **Accuracy**: `rewards/accuracies` > 60% (ideally 70-80%)
5. βœ… **No Overfitting**: Eval loss doesn't diverge from train loss
6. βœ… **Model Saves**: Final checkpoint and merged model created

## After Training

### 1. Evaluate Model
```bash
# Test on held-out data
python evaluate_dpo_model.py \
  --model runs/dpo_run_14b_v1/merged_14b_dpo_lora \
  --test_data ../task2file/sft_qwen_14B/test.jsonl
```

### 2. Run Inference
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "runs/dpo_run_14b_v1/merged_14b_dpo_lora",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("runs/dpo_run_14b_v1/merged_14b_dpo_lora")

# Generate responses
messages = [{"role": "user", "content": "Write a Python function to sort a list"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### 3. Compare with Base Model
```bash
# Generate responses from both models on same prompts
# Compare quality, helpfulness, safety
```

### 4. Proceed to GRPO (Optional)
```bash
# If DPO results are good, train GRPO on top
cd ../GRPO-14b
# Update config to use DPO model as base
python run_grpo.py --config config_grpo.yaml
```

## Files Reference

- `run_dpo.py` - Main training script (954 lines, all fixes applied)
- `config_dpo.yaml` - Training configuration
- `dpo_pairs_generated.jsonl` - Training data (7,612 pairs)
- `f1_score_utils.py` - F1 scoring utilities
- `create_synthetic_pairs.py` - Data generation script
- `FIXES_APPLIED.md` - Documentation of all fixes
- `test_fixes.py` - Verification script
- `README.md` - Detailed documentation

## Support

For issues:
1. Check logs: `runs/dpo_run_14b_v1/logs/train.jsonl`
2. Review errors: Look for "ERROR" or "WARNING" in output
3. Verify fixes: Run `python test_fixes.py`
4. Check documentation: `FIXES_APPLIED.md`, `README.md`

---

**Status**: βœ… All systems ready  
**Last Verified**: $(date)  
**Ready to Start**: YES

**Command to run:**
```bash
cd /workspace/trainer-kit/DPO-14b && python run_dpo.py --config config_dpo.yaml
```