File size: 4,620 Bytes

360fa61


COMPARATIVE ANALYSIS
====================

1. QUALITATIVE DIFFERENCES:
---------------------------
Base Model (Llama-3.2-1B):
- Pretty balanced responses
- No preference optimization
- High variance in quality
- Sometimes unfocused

DPO-PairRM Model:
- Better structure and organization
- More technical/accurate
- Consistent formatting
- Tends toward formal language

DPO-LLM Judge Model:
- More conversational tone
- Creative explanations
- Shows judge's biases
- Better context maintenance

Key points:
- DPO definitely shifts the response distribution
- Each preference source gives different characteristics
- Trade-off between consistency and diversity

2. DPO TRAINING DETAILS:
-----------------------
Config:
- Algorithm: DPO (Direct Preference Optimization)
- Beta: 0.1 (KL divergence)
- Learning rate: 0.0002
- Batch size: 1, grad accumulation: 8
- Steps: 250
- LoRA: r=8, alpha=16

Why DPO > Supervised Fine-tuning:
- Optimizes for preferences directly
- Uses both chosen and rejected responses
- KL constraint prevents mode collapse
- More sample-efficient
- Better generalization

Loss function:
- DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected)))
- Beta controls exploration vs exploitation
- Lower beta = more deviation from base
- Higher beta = stays closer to original

3. TRAINING STABILITY:
---------------------
PairRM Model:
- Converged ~100 steps
- Smooth loss decrease
- No overfitting until 200+ steps
- Stable gradients

LLM Judge Model:
- Converged ~150 steps
- More early variance
- Some validation loss oscillation
- Still robust final performance

Regularization:
- LoRA dropout: 0.15
- Gradient checkpointing on
- 8-bit quantization
- No early stopping needed

4. COMPUTATIONAL EFFICIENCY:
---------------------------
Resources:
- GPU Memory: ~7.5GB peak
- Training: ~25 min per model
- Inference: ~14-15 tokens/sec
- Disk: ~200MB per adapter

Optimizations:
- 8-bit quantization (bitsandbytes)
- LoRA (only 0.14% params trainable)
- Gradient accumulation
- Mixed precision (FP16)

Scaling:
- Linear with dataset size
- Can handle 3B models on consumer GPU
- Batch processing possible
- Don't need distributed training

5. LIMITATIONS AND FAILURE MODES:
---------------------------------
Dataset issues:
- Only 50 instructions (not much diversity)
- LIMA might be biased
- Synthetic responses lack nuance
- No human validation

Judge issues:
- LLM Judge has base model biases
- PairRM has its own distribution
- No ensemble to reduce variance
- Position bias still possible

Training issues:
- Fixed beta might not be optimal
- Limited hyperparameter tuning
- No curriculum learning
- Not iterative (except extra credit)

Failure modes:
1. Reward hacking - exploits judge weaknesses
2. Mode collapse - over-optimizes narrow distribution
3. Hallucination - confident but wrong
4. Length bias - longer isn't always better
5. Format overfitting - too rigid

Fixes:
- Use diverse preference sources
- Add human evaluation
- Monitor distribution shifts
- Ensemble of judges
- Constitutional constraints

6. SUGGESTIONS FOR IMPROVEMENT:
------------------------------
Quick wins:
1. More instructions (500+)
2. Multiple judges
3. Human validation
4. Diversity metrics
5. Tune beta per dataset

Advanced stuff:
1. Iterative DPO with online preferences
2. Constitutional AI
3. Rejection sampling
4. Multi-objective optimization
5. Active learning

Experimental:
1. Curriculum learning
2. Meta-learning
3. Adversarial preferences
4. Cross-lingual transfer
5. Preference distillation

7. METRICS:
----------
Performance:
- Response length: Base (95), PairRM (108), Judge (102) tokens
- Coherence: Base (0.72), PairRM (0.85), Judge (0.81)
- Task adherence: Base (68%), PairRM (89%), Judge (84%)
- Diversity: Base (0.83), PairRM (0.71), Judge (0.75)

Training:
- Final loss: PairRM (0.0001), Judge (0.0000)
- Time: PairRM (24:50), Judge (33:54)
- Parameters: 1,703,936 (0.14%)
- Best val loss: PairRM (0.003), Judge (0.002)

8. CONCLUSION:
-------------
DPO works well for aligning models with different preference sources.
PairRM gives more structured/technical responses, LLM Judge gives more
conversational ones. Both beat the base model for task adherence.

Main takeaways:
- DPO works without tons of data
- Preference source really matters
- LoRA makes it feasible on consumer hardware
- Beta parameter is important

Next steps: scale preference collection, try iterative refinement,
better evaluation metrics.