COMPARATIVE ANALYSIS ==================== 1. QUALITATIVE DIFFERENCES: --------------------------- Base Model (Llama-3.2-1B): - Pretty balanced responses - No preference optimization - High variance in quality - Sometimes unfocused DPO-PairRM Model: - Better structure and organization - More technical/accurate - Consistent formatting - Tends toward formal language DPO-LLM Judge Model: - More conversational tone - Creative explanations - Shows judge's biases - Better context maintenance Key points: - DPO definitely shifts the response distribution - Each preference source gives different characteristics - Trade-off between consistency and diversity 2. DPO TRAINING DETAILS: ----------------------- Config: - Algorithm: DPO (Direct Preference Optimization) - Beta: 0.1 (KL divergence) - Learning rate: 0.0002 - Batch size: 1, grad accumulation: 8 - Steps: 250 - LoRA: r=8, alpha=16 Why DPO > Supervised Fine-tuning: - Optimizes for preferences directly - Uses both chosen and rejected responses - KL constraint prevents mode collapse - More sample-efficient - Better generalization Loss function: - DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected))) - Beta controls exploration vs exploitation - Lower beta = more deviation from base - Higher beta = stays closer to original 3. TRAINING STABILITY: --------------------- PairRM Model: - Converged ~100 steps - Smooth loss decrease - No overfitting until 200+ steps - Stable gradients LLM Judge Model: - Converged ~150 steps - More early variance - Some validation loss oscillation - Still robust final performance Regularization: - LoRA dropout: 0.15 - Gradient checkpointing on - 8-bit quantization - No early stopping needed 4. COMPUTATIONAL EFFICIENCY: --------------------------- Resources: - GPU Memory: ~7.5GB peak - Training: ~25 min per model - Inference: ~14-15 tokens/sec - Disk: ~200MB per adapter Optimizations: - 8-bit quantization (bitsandbytes) - LoRA (only 0.14% params trainable) - Gradient accumulation - Mixed precision (FP16) Scaling: - Linear with dataset size - Can handle 3B models on consumer GPU - Batch processing possible - Don't need distributed training 5. LIMITATIONS AND FAILURE MODES: --------------------------------- Dataset issues: - Only 50 instructions (not much diversity) - LIMA might be biased - Synthetic responses lack nuance - No human validation Judge issues: - LLM Judge has base model biases - PairRM has its own distribution - No ensemble to reduce variance - Position bias still possible Training issues: - Fixed beta might not be optimal - Limited hyperparameter tuning - No curriculum learning - Not iterative (except extra credit) Failure modes: 1. Reward hacking - exploits judge weaknesses 2. Mode collapse - over-optimizes narrow distribution 3. Hallucination - confident but wrong 4. Length bias - longer isn't always better 5. Format overfitting - too rigid Fixes: - Use diverse preference sources - Add human evaluation - Monitor distribution shifts - Ensemble of judges - Constitutional constraints 6. SUGGESTIONS FOR IMPROVEMENT: ------------------------------ Quick wins: 1. More instructions (500+) 2. Multiple judges 3. Human validation 4. Diversity metrics 5. Tune beta per dataset Advanced stuff: 1. Iterative DPO with online preferences 2. Constitutional AI 3. Rejection sampling 4. Multi-objective optimization 5. Active learning Experimental: 1. Curriculum learning 2. Meta-learning 3. Adversarial preferences 4. Cross-lingual transfer 5. Preference distillation 7. METRICS: ---------- Performance: - Response length: Base (95), PairRM (108), Judge (102) tokens - Coherence: Base (0.72), PairRM (0.85), Judge (0.81) - Task adherence: Base (68%), PairRM (89%), Judge (84%) - Diversity: Base (0.83), PairRM (0.71), Judge (0.75) Training: - Final loss: PairRM (0.0001), Judge (0.0000) - Time: PairRM (24:50), Judge (33:54) - Parameters: 1,703,936 (0.14%) - Best val loss: PairRM (0.003), Judge (0.002) 8. CONCLUSION: ------------- DPO works well for aligning models with different preference sources. PairRM gives more structured/technical responses, LLM Judge gives more conversational ones. Both beat the base model for task adherence. Main takeaways: - DPO works without tons of data - Preference source really matters - LoRA makes it feasible on consumer hardware - Beta parameter is important Next steps: scale preference collection, try iterative refinement, better evaluation metrics.