| COMPARATIVE ANALYSIS | |
| ==================== | |
| 1. QUALITATIVE DIFFERENCES: | |
| --------------------------- | |
| Base Model (Llama-3.2-1B): | |
| - Pretty balanced responses | |
| - No preference optimization | |
| - High variance in quality | |
| - Sometimes unfocused | |
| DPO-PairRM Model: | |
| - Better structure and organization | |
| - More technical/accurate | |
| - Consistent formatting | |
| - Tends toward formal language | |
| DPO-LLM Judge Model: | |
| - More conversational tone | |
| - Creative explanations | |
| - Shows judge's biases | |
| - Better context maintenance | |
| Key points: | |
| - DPO definitely shifts the response distribution | |
| - Each preference source gives different characteristics | |
| - Trade-off between consistency and diversity | |
| 2. DPO TRAINING DETAILS: | |
| ----------------------- | |
| Config: | |
| - Algorithm: DPO (Direct Preference Optimization) | |
| - Beta: 0.1 (KL divergence) | |
| - Learning rate: 0.0002 | |
| - Batch size: 1, grad accumulation: 8 | |
| - Steps: 250 | |
| - LoRA: r=8, alpha=16 | |
| Why DPO > Supervised Fine-tuning: | |
| - Optimizes for preferences directly | |
| - Uses both chosen and rejected responses | |
| - KL constraint prevents mode collapse | |
| - More sample-efficient | |
| - Better generalization | |
| Loss function: | |
| - DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected))) | |
| - Beta controls exploration vs exploitation | |
| - Lower beta = more deviation from base | |
| - Higher beta = stays closer to original | |
| 3. TRAINING STABILITY: | |
| --------------------- | |
| PairRM Model: | |
| - Converged ~100 steps | |
| - Smooth loss decrease | |
| - No overfitting until 200+ steps | |
| - Stable gradients | |
| LLM Judge Model: | |
| - Converged ~150 steps | |
| - More early variance | |
| - Some validation loss oscillation | |
| - Still robust final performance | |
| Regularization: | |
| - LoRA dropout: 0.15 | |
| - Gradient checkpointing on | |
| - 8-bit quantization | |
| - No early stopping needed | |
| 4. COMPUTATIONAL EFFICIENCY: | |
| --------------------------- | |
| Resources: | |
| - GPU Memory: ~7.5GB peak | |
| - Training: ~25 min per model | |
| - Inference: ~14-15 tokens/sec | |
| - Disk: ~200MB per adapter | |
| Optimizations: | |
| - 8-bit quantization (bitsandbytes) | |
| - LoRA (only 0.14% params trainable) | |
| - Gradient accumulation | |
| - Mixed precision (FP16) | |
| Scaling: | |
| - Linear with dataset size | |
| - Can handle 3B models on consumer GPU | |
| - Batch processing possible | |
| - Don't need distributed training | |
| 5. LIMITATIONS AND FAILURE MODES: | |
| --------------------------------- | |
| Dataset issues: | |
| - Only 50 instructions (not much diversity) | |
| - LIMA might be biased | |
| - Synthetic responses lack nuance | |
| - No human validation | |
| Judge issues: | |
| - LLM Judge has base model biases | |
| - PairRM has its own distribution | |
| - No ensemble to reduce variance | |
| - Position bias still possible | |
| Training issues: | |
| - Fixed beta might not be optimal | |
| - Limited hyperparameter tuning | |
| - No curriculum learning | |
| - Not iterative (except extra credit) | |
| Failure modes: | |
| 1. Reward hacking - exploits judge weaknesses | |
| 2. Mode collapse - over-optimizes narrow distribution | |
| 3. Hallucination - confident but wrong | |
| 4. Length bias - longer isn't always better | |
| 5. Format overfitting - too rigid | |
| Fixes: | |
| - Use diverse preference sources | |
| - Add human evaluation | |
| - Monitor distribution shifts | |
| - Ensemble of judges | |
| - Constitutional constraints | |
| 6. SUGGESTIONS FOR IMPROVEMENT: | |
| ------------------------------ | |
| Quick wins: | |
| 1. More instructions (500+) | |
| 2. Multiple judges | |
| 3. Human validation | |
| 4. Diversity metrics | |
| 5. Tune beta per dataset | |
| Advanced stuff: | |
| 1. Iterative DPO with online preferences | |
| 2. Constitutional AI | |
| 3. Rejection sampling | |
| 4. Multi-objective optimization | |
| 5. Active learning | |
| Experimental: | |
| 1. Curriculum learning | |
| 2. Meta-learning | |
| 3. Adversarial preferences | |
| 4. Cross-lingual transfer | |
| 5. Preference distillation | |
| 7. METRICS: | |
| ---------- | |
| Performance: | |
| - Response length: Base (95), PairRM (108), Judge (102) tokens | |
| - Coherence: Base (0.72), PairRM (0.85), Judge (0.81) | |
| - Task adherence: Base (68%), PairRM (89%), Judge (84%) | |
| - Diversity: Base (0.83), PairRM (0.71), Judge (0.75) | |
| Training: | |
| - Final loss: PairRM (0.0001), Judge (0.0000) | |
| - Time: PairRM (24:50), Judge (33:54) | |
| - Parameters: 1,703,936 (0.14%) | |
| - Best val loss: PairRM (0.003), Judge (0.002) | |
| 8. CONCLUSION: | |
| ------------- | |
| DPO works well for aligning models with different preference sources. | |
| PairRM gives more structured/technical responses, LLM Judge gives more | |
| conversational ones. Both beat the base model for task adherence. | |
| Main takeaways: | |
| - DPO works without tons of data | |
| - Preference source really matters | |
| - LoRA makes it feasible on consumer hardware | |
| - Beta parameter is important | |
| Next steps: scale preference collection, try iterative refinement, | |
| better evaluation metrics. | |