LLaMA 3 DPO Fine-Tuned Model
Model Overview
This model is a fine-tuned version of Meta LLaMA 3 8B, trained using Direct Preference Optimization (DPO) to improve response quality, alignment, and helpfulness.
The model learns to prefer better responses over weaker ones using human feedback-style datasets.
Model Details
Base Model: meta-llama/Meta-Llama-3-8B
Fine-Tuning Method: DPO (Direct Preference Optimization)
Adapter Type: LoRA (Parameter-Efficient Fine-Tuning)
Framework: Hugging Face Transformers + TRL + PEFT
Language: English
License: Same as base model (LLaMA 3 license)
Use Case: Chat, instruction following, aligned responses
Intended Use
Direct Use:
- Chatbots
- AI assistants
- Instruction-following tasks
- Educational tools
Downstream Use:
- Further fine-tuning (SFT / RLHF)
- Integration into applications (RAG, agents)
- Domain-specific assistants
Out-of-Scope Use:
- Harmful or malicious content generation
- Legal/medical advice without verification
- Fully autonomous decision-making systems
Limitations and Risks
- May generate hallucinated or incorrect responses
- Biases present in training data may persist
- Performance depends on prompt quality
- Not fully aligned like production-grade LLMs
How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "your-username/llama3-dpo-final"
tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path)
prompt = "Human: Explain machine learning\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Dataset:
- Anthropic HH-RLHF dataset
- Contains:
prompt
chosen (preferred response)
rejected (less preferred response)
Training Method:
- Direct Preference Optimization (DPO)
- No reward model required
- Learns preference directly from comparisons
Hyperparameters:
- Batch size: 1
- Gradient accumulation: 4
- Learning rate: 5e-5
- Epochs: 1
- Precision: FP32
- LoRA rank: 8
- LoRA alpha: 16
Evaluation
Method:
- Manual qualitative evaluation
- Comparison of generated outputs vs chosen responses
Example: Prompt: What is machine learning?
Output: Model generates a structured and helpful explanation compared to base model.
Model Behavior
The model is optimized to:
- Prefer helpful, safe, and informative responses
- Avoid vague or low-quality answers
- Follow conversational structure (Human/Assistant)
Technical Details
Architecture:
- LLaMA 3 Transformer
- Decoder-only causal language model
Fine-Tuning:
- LoRA adapters applied on:
q_proj
v_proj
Quantization:
- 4-bit (NF4) using BitsAndBytes
Environmental Impact
- Hardware: NVIDIA T4 GPU
- Training Time: ~30–60 minutes
- Framework: Kaggle Notebook Environment
- Estimated Impact: Low (due to PEFT + quantization)
References
- https://arxiv.org/abs/2305.18290
- Hugging Face TRL Library
- Anthropic HH-RLHF Dataset
Acknowledgements
- Meta AI for LLaMA 3
- Hugging Face for Transformers ecosystem
- Anthropic for RLHF dataset
Contact
For questions or improvements, open an issue or discussion in the repository.