LLaMA 3 DPO Fine-Tuned Model

Model Overview

This model is a fine-tuned version of Meta LLaMA 3 8B, trained using Direct Preference Optimization (DPO) to improve response quality, alignment, and helpfulness.

The model learns to prefer better responses over weaker ones using human feedback-style datasets.


Model Details

Base Model: meta-llama/Meta-Llama-3-8B
Fine-Tuning Method: DPO (Direct Preference Optimization)
Adapter Type: LoRA (Parameter-Efficient Fine-Tuning)
Framework: Hugging Face Transformers + TRL + PEFT
Language: English
License: Same as base model (LLaMA 3 license)
Use Case: Chat, instruction following, aligned responses


Intended Use

Direct Use:

  • Chatbots
  • AI assistants
  • Instruction-following tasks
  • Educational tools

Downstream Use:

  • Further fine-tuning (SFT / RLHF)
  • Integration into applications (RAG, agents)
  • Domain-specific assistants

Out-of-Scope Use:

  • Harmful or malicious content generation
  • Legal/medical advice without verification
  • Fully autonomous decision-making systems

Limitations and Risks

  • May generate hallucinated or incorrect responses
  • Biases present in training data may persist
  • Performance depends on prompt quality
  • Not fully aligned like production-grade LLMs

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "your-username/llama3-dpo-final"

tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path)

prompt = "Human: Explain machine learning\nAssistant:"

inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Training Details

Dataset:

  • Anthropic HH-RLHF dataset
  • Contains: prompt
    chosen (preferred response)
    rejected (less preferred response)

Training Method:

  • Direct Preference Optimization (DPO)
  • No reward model required
  • Learns preference directly from comparisons

Hyperparameters:

  • Batch size: 1
  • Gradient accumulation: 4
  • Learning rate: 5e-5
  • Epochs: 1
  • Precision: FP32
  • LoRA rank: 8
  • LoRA alpha: 16

Evaluation

Method:

  • Manual qualitative evaluation
  • Comparison of generated outputs vs chosen responses

Example: Prompt: What is machine learning?

Output: Model generates a structured and helpful explanation compared to base model.


Model Behavior

The model is optimized to:

  • Prefer helpful, safe, and informative responses
  • Avoid vague or low-quality answers
  • Follow conversational structure (Human/Assistant)

Technical Details

Architecture:

  • LLaMA 3 Transformer
  • Decoder-only causal language model

Fine-Tuning:

  • LoRA adapters applied on: q_proj
    v_proj

Quantization:

  • 4-bit (NF4) using BitsAndBytes

Environmental Impact

  • Hardware: NVIDIA T4 GPU
  • Training Time: ~30–60 minutes
  • Framework: Kaggle Notebook Environment
  • Estimated Impact: Low (due to PEFT + quantization)

References


Acknowledgements

  • Meta AI for LLaMA 3
  • Hugging Face for Transformers ecosystem
  • Anthropic for RLHF dataset

Contact

For questions or improvements, open an issue or discussion in the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for Sujith2121/llama3-dpo-final