LLaMA 3 DPO Fine-Tuned Model

Model Overview

This model is a fine-tuned version of Meta LLaMA 3 8B, trained using Direct Preference Optimization (DPO) to improve response quality, alignment, and helpfulness.

The model learns to prefer better responses over weaker ones using human feedback-style datasets.

Model Details

Base Model: meta-llama/Meta-Llama-3-8B
Fine-Tuning Method: DPO (Direct Preference Optimization)
Adapter Type: LoRA (Parameter-Efficient Fine-Tuning)
Framework: Hugging Face Transformers + TRL + PEFT
Language: English
License: Same as base model (LLaMA 3 license)
Use Case: Chat, instruction following, aligned responses

Intended Use

Direct Use:

Chatbots
AI assistants
Instruction-following tasks
Educational tools

Downstream Use:

Further fine-tuning (SFT / RLHF)
Integration into applications (RAG, agents)
Domain-specific assistants

Out-of-Scope Use:

Harmful or malicious content generation
Legal/medical advice without verification
Fully autonomous decision-making systems

Limitations and Risks

May generate hallucinated or incorrect responses
Biases present in training data may persist
Performance depends on prompt quality
Not fully aligned like production-grade LLMs

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "your-username/llama3-dpo-final"

tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path)

prompt = "Human: Explain machine learning\nAssistant:"

inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Dataset:

Anthropic HH-RLHF dataset
Contains: prompt
chosen (preferred response)
rejected (less preferred response)

Training Method:

Direct Preference Optimization (DPO)
No reward model required
Learns preference directly from comparisons

Hyperparameters:

Batch size: 1
Gradient accumulation: 4
Learning rate: 5e-5
Epochs: 1
Precision: FP32
LoRA rank: 8
LoRA alpha: 16

Evaluation

Method:

Manual qualitative evaluation
Comparison of generated outputs vs chosen responses

Example: Prompt: What is machine learning?

Output: Model generates a structured and helpful explanation compared to base model.

Model Behavior

The model is optimized to:

Prefer helpful, safe, and informative responses
Avoid vague or low-quality answers
Follow conversational structure (Human/Assistant)

Technical Details

Architecture:

LLaMA 3 Transformer
Decoder-only causal language model

Fine-Tuning:

LoRA adapters applied on: q_proj
v_proj

Quantization:

4-bit (NF4) using BitsAndBytes

Environmental Impact

Hardware: NVIDIA T4 GPU
Training Time: ~30–60 minutes
Framework: Kaggle Notebook Environment
Estimated Impact: Low (due to PEFT + quantization)

References

https://arxiv.org/abs/2305.18290
Hugging Face TRL Library
Anthropic HH-RLHF Dataset

Acknowledgements

Meta AI for LLaMA 3
Hugging Face for Transformers ecosystem
Anthropic for RLHF dataset

Contact

For questions or improvements, open an issue or discussion in the repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Paper for Sujith2121/llama3-dpo-final

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 66