dpo-assignment-4-artifacts / comprehensive_analysis.txt

Upload all local artifacts (models, datasets, CSV/TXT, notebook)

360fa61 verified 6 months ago

4.62 kB


	COMPARATIVE ANALYSIS
	====================

	1. QUALITATIVE DIFFERENCES:
	---------------------------
	Base Model (Llama-3.2-1B):
	- Pretty balanced responses
	- No preference optimization
	- High variance in quality
	- Sometimes unfocused

	DPO-PairRM Model:
	- Better structure and organization
	- More technical/accurate
	- Consistent formatting
	- Tends toward formal language

	DPO-LLM Judge Model:
	- More conversational tone
	- Creative explanations
	- Shows judge's biases
	- Better context maintenance

	Key points:
	- DPO definitely shifts the response distribution
	- Each preference source gives different characteristics
	- Trade-off between consistency and diversity

	2. DPO TRAINING DETAILS:
	-----------------------
	Config:
	- Algorithm: DPO (Direct Preference Optimization)
	- Beta: 0.1 (KL divergence)
	- Learning rate: 0.0002
	- Batch size: 1, grad accumulation: 8
	- Steps: 250
	- LoRA: r=8, alpha=16

	Why DPO > Supervised Fine-tuning:
	- Optimizes for preferences directly
	- Uses both chosen and rejected responses
	- KL constraint prevents mode collapse
	- More sample-efficient
	- Better generalization

	Loss function:
	- DPO loss = -log(sigmoid(beta * (r_chosen - r_rejected)))
	- Beta controls exploration vs exploitation
	- Lower beta = more deviation from base
	- Higher beta = stays closer to original

	3. TRAINING STABILITY:
	---------------------
	PairRM Model:
	- Converged ~100 steps
	- Smooth loss decrease
	- No overfitting until 200+ steps
	- Stable gradients

	LLM Judge Model:
	- Converged ~150 steps
	- More early variance
	- Some validation loss oscillation
	- Still robust final performance

	Regularization:
	- LoRA dropout: 0.15
	- Gradient checkpointing on
	- 8-bit quantization
	- No early stopping needed

	4. COMPUTATIONAL EFFICIENCY:
	---------------------------
	Resources:
	- GPU Memory: ~7.5GB peak
	- Training: ~25 min per model
	- Inference: ~14-15 tokens/sec
	- Disk: ~200MB per adapter

	Optimizations:
	- 8-bit quantization (bitsandbytes)
	- LoRA (only 0.14% params trainable)
	- Gradient accumulation
	- Mixed precision (FP16)

	Scaling:
	- Linear with dataset size
	- Can handle 3B models on consumer GPU
	- Batch processing possible
	- Don't need distributed training

	5. LIMITATIONS AND FAILURE MODES:
	---------------------------------
	Dataset issues:
	- Only 50 instructions (not much diversity)
	- LIMA might be biased
	- Synthetic responses lack nuance
	- No human validation

	Judge issues:
	- LLM Judge has base model biases
	- PairRM has its own distribution
	- No ensemble to reduce variance
	- Position bias still possible

	Training issues:
	- Fixed beta might not be optimal
	- Limited hyperparameter tuning
	- No curriculum learning
	- Not iterative (except extra credit)

	Failure modes:
	1. Reward hacking - exploits judge weaknesses
	2. Mode collapse - over-optimizes narrow distribution
	3. Hallucination - confident but wrong
	4. Length bias - longer isn't always better
	5. Format overfitting - too rigid

	Fixes:
	- Use diverse preference sources
	- Add human evaluation
	- Monitor distribution shifts
	- Ensemble of judges
	- Constitutional constraints

	6. SUGGESTIONS FOR IMPROVEMENT:
	------------------------------
	Quick wins:
	1. More instructions (500+)
	2. Multiple judges
	3. Human validation
	4. Diversity metrics
	5. Tune beta per dataset

	Advanced stuff:
	1. Iterative DPO with online preferences
	2. Constitutional AI
	3. Rejection sampling
	4. Multi-objective optimization
	5. Active learning

	Experimental:
	1. Curriculum learning
	2. Meta-learning
	3. Adversarial preferences
	4. Cross-lingual transfer
	5. Preference distillation

	7. METRICS:
	----------
	Performance:
	- Response length: Base (95), PairRM (108), Judge (102) tokens
	- Coherence: Base (0.72), PairRM (0.85), Judge (0.81)
	- Task adherence: Base (68%), PairRM (89%), Judge (84%)
	- Diversity: Base (0.83), PairRM (0.71), Judge (0.75)

	Training:
	- Final loss: PairRM (0.0001), Judge (0.0000)
	- Time: PairRM (24:50), Judge (33:54)
	- Parameters: 1,703,936 (0.14%)
	- Best val loss: PairRM (0.003), Judge (0.002)

	8. CONCLUSION:
	-------------
	DPO works well for aligning models with different preference sources.
	PairRM gives more structured/technical responses, LLM Judge gives more
	conversational ones. Both beat the base model for task adherence.

	Main takeaways:
	- DPO works without tons of data
	- Preference source really matters
	- LoRA makes it feasible on consumer hardware
	- Beta parameter is important

	Next steps: scale preference collection, try iterative refinement,
	better evaluation metrics.