File size: 4,620 Bytes
360fa61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178

COMPARATIVE ANALYSIS
====================

1. QUALITATIVE DIFFERENCES:
---------------------------
Base Model (Llama-3.2-1B):
- Pretty balanced responses
- No preference optimization
- High variance in quality
- Sometimes unfocused

DPO-PairRM Model:
- Better structure and organization
- More technical/accurate
- Consistent formatting
- Tends toward formal language

DPO-LLM Judge Model:
- More conversational tone
- Creative explanations
- Shows judge's biases
- Better context maintenance

Key points:
- DPO definitely shifts the response distribution
- Each preference source gives different characteristics
- Trade-off between consistency and diversity

2. DPO TRAINING DETAILS:
-----------------------
Config:
- Algorithm: DPO (Direct Preference Optimization)
- Beta: 0.1 (KL divergence)
- Learning rate: 0.0002
- Batch size: 1, grad accumulation: 8
- Steps: 250
- LoRA: r=8, alpha=16

Why DPO > Supervised Fine-tuning:
- Optimizes for preferences directly
- Uses both chosen and rejected responses
- KL constraint prevents mode collapse
- More sample-efficient
- Better generalization

Loss function:
- DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected)))
- Beta controls exploration vs exploitation
- Lower beta = more deviation from base
- Higher beta = stays closer to original

3. TRAINING STABILITY:
---------------------
PairRM Model:
- Converged ~100 steps
- Smooth loss decrease
- No overfitting until 200+ steps
- Stable gradients

LLM Judge Model:
- Converged ~150 steps
- More early variance
- Some validation loss oscillation
- Still robust final performance

Regularization:
- LoRA dropout: 0.15
- Gradient checkpointing on
- 8-bit quantization
- No early stopping needed

4. COMPUTATIONAL EFFICIENCY:
---------------------------
Resources:
- GPU Memory: ~7.5GB peak
- Training: ~25 min per model
- Inference: ~14-15 tokens/sec
- Disk: ~200MB per adapter

Optimizations:
- 8-bit quantization (bitsandbytes)
- LoRA (only 0.14% params trainable)
- Gradient accumulation
- Mixed precision (FP16)

Scaling:
- Linear with dataset size
- Can handle 3B models on consumer GPU
- Batch processing possible
- Don't need distributed training

5. LIMITATIONS AND FAILURE MODES:
---------------------------------
Dataset issues:
- Only 50 instructions (not much diversity)
- LIMA might be biased
- Synthetic responses lack nuance
- No human validation

Judge issues:
- LLM Judge has base model biases
- PairRM has its own distribution
- No ensemble to reduce variance
- Position bias still possible

Training issues:
- Fixed beta might not be optimal
- Limited hyperparameter tuning
- No curriculum learning
- Not iterative (except extra credit)

Failure modes:
1. Reward hacking - exploits judge weaknesses
2. Mode collapse - over-optimizes narrow distribution
3. Hallucination - confident but wrong
4. Length bias - longer isn't always better
5. Format overfitting - too rigid

Fixes:
- Use diverse preference sources
- Add human evaluation
- Monitor distribution shifts
- Ensemble of judges
- Constitutional constraints

6. SUGGESTIONS FOR IMPROVEMENT:
------------------------------
Quick wins:
1. More instructions (500+)
2. Multiple judges
3. Human validation
4. Diversity metrics
5. Tune beta per dataset

Advanced stuff:
1. Iterative DPO with online preferences
2. Constitutional AI
3. Rejection sampling
4. Multi-objective optimization
5. Active learning

Experimental:
1. Curriculum learning
2. Meta-learning
3. Adversarial preferences
4. Cross-lingual transfer
5. Preference distillation

7. METRICS:
----------
Performance:
- Response length: Base (95), PairRM (108), Judge (102) tokens
- Coherence: Base (0.72), PairRM (0.85), Judge (0.81)
- Task adherence: Base (68%), PairRM (89%), Judge (84%)
- Diversity: Base (0.83), PairRM (0.71), Judge (0.75)

Training:
- Final loss: PairRM (0.0001), Judge (0.0000)
- Time: PairRM (24:50), Judge (33:54)
- Parameters: 1,703,936 (0.14%)
- Best val loss: PairRM (0.003), Judge (0.002)

8. CONCLUSION:
-------------
DPO works well for aligning models with different preference sources.
PairRM gives more structured/technical responses, LLM Judge gives more
conversational ones. Both beat the base model for task adherence.

Main takeaways:
- DPO works without tons of data
- Preference source really matters
- LoRA makes it feasible on consumer hardware
- Beta parameter is important

Next steps: scale preference collection, try iterative refinement,
better evaluation metrics.