File size: 4,620 Bytes
360fa61 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | COMPARATIVE ANALYSIS ==================== 1. QUALITATIVE DIFFERENCES: --------------------------- Base Model (Llama-3.2-1B): - Pretty balanced responses - No preference optimization - High variance in quality - Sometimes unfocused DPO-PairRM Model: - Better structure and organization - More technical/accurate - Consistent formatting - Tends toward formal language DPO-LLM Judge Model: - More conversational tone - Creative explanations - Shows judge's biases - Better context maintenance Key points: - DPO definitely shifts the response distribution - Each preference source gives different characteristics - Trade-off between consistency and diversity 2. DPO TRAINING DETAILS: ----------------------- Config: - Algorithm: DPO (Direct Preference Optimization) - Beta: 0.1 (KL divergence) - Learning rate: 0.0002 - Batch size: 1, grad accumulation: 8 - Steps: 250 - LoRA: r=8, alpha=16 Why DPO > Supervised Fine-tuning: - Optimizes for preferences directly - Uses both chosen and rejected responses - KL constraint prevents mode collapse - More sample-efficient - Better generalization Loss function: - DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected))) - Beta controls exploration vs exploitation - Lower beta = more deviation from base - Higher beta = stays closer to original 3. TRAINING STABILITY: --------------------- PairRM Model: - Converged ~100 steps - Smooth loss decrease - No overfitting until 200+ steps - Stable gradients LLM Judge Model: - Converged ~150 steps - More early variance - Some validation loss oscillation - Still robust final performance Regularization: - LoRA dropout: 0.15 - Gradient checkpointing on - 8-bit quantization - No early stopping needed 4. COMPUTATIONAL EFFICIENCY: --------------------------- Resources: - GPU Memory: ~7.5GB peak - Training: ~25 min per model - Inference: ~14-15 tokens/sec - Disk: ~200MB per adapter Optimizations: - 8-bit quantization (bitsandbytes) - LoRA (only 0.14% params trainable) - Gradient accumulation - Mixed precision (FP16) Scaling: - Linear with dataset size - Can handle 3B models on consumer GPU - Batch processing possible - Don't need distributed training 5. LIMITATIONS AND FAILURE MODES: --------------------------------- Dataset issues: - Only 50 instructions (not much diversity) - LIMA might be biased - Synthetic responses lack nuance - No human validation Judge issues: - LLM Judge has base model biases - PairRM has its own distribution - No ensemble to reduce variance - Position bias still possible Training issues: - Fixed beta might not be optimal - Limited hyperparameter tuning - No curriculum learning - Not iterative (except extra credit) Failure modes: 1. Reward hacking - exploits judge weaknesses 2. Mode collapse - over-optimizes narrow distribution 3. Hallucination - confident but wrong 4. Length bias - longer isn't always better 5. Format overfitting - too rigid Fixes: - Use diverse preference sources - Add human evaluation - Monitor distribution shifts - Ensemble of judges - Constitutional constraints 6. SUGGESTIONS FOR IMPROVEMENT: ------------------------------ Quick wins: 1. More instructions (500+) 2. Multiple judges 3. Human validation 4. Diversity metrics 5. Tune beta per dataset Advanced stuff: 1. Iterative DPO with online preferences 2. Constitutional AI 3. Rejection sampling 4. Multi-objective optimization 5. Active learning Experimental: 1. Curriculum learning 2. Meta-learning 3. Adversarial preferences 4. Cross-lingual transfer 5. Preference distillation 7. METRICS: ---------- Performance: - Response length: Base (95), PairRM (108), Judge (102) tokens - Coherence: Base (0.72), PairRM (0.85), Judge (0.81) - Task adherence: Base (68%), PairRM (89%), Judge (84%) - Diversity: Base (0.83), PairRM (0.71), Judge (0.75) Training: - Final loss: PairRM (0.0001), Judge (0.0000) - Time: PairRM (24:50), Judge (33:54) - Parameters: 1,703,936 (0.14%) - Best val loss: PairRM (0.003), Judge (0.002) 8. CONCLUSION: ------------- DPO works well for aligning models with different preference sources. PairRM gives more structured/technical responses, LLM Judge gives more conversational ones. Both beat the base model for task adherence. Main takeaways: - DPO works without tons of data - Preference source really matters - LoRA makes it feasible on consumer hardware - Beta parameter is important Next steps: scale preference collection, try iterative refinement, better evaluation metrics. |