π¦ Llama-3 DPO Alignment Pipeline
A professional, modular pipeline to fine-tune Llama-3 8B using Direct Preference Optimization (DPO) β from training to inference to benchmarking.
π What is This Project?
This repository contains a complete, production-grade pipeline for aligning a large language model using DPO. Instead of manually patching notebooks together, this project is structured as reusable Python scripts with YAML configurations so anyone can reproduce the training with a single command.
β‘ The Core Idea
| Concept | What We Did |
|---|---|
| Base Model | unsloth/llama-3-8b-Instruct-bnb-4bit |
| Alignment Technique | Direct Preference Optimization (DPO) |
| Dataset | Intel/orca_dpo_pairs (1,000 samples) |
| Speed Optimization | Unsloth (2x faster training) |
| Memory Optimization | 4-bit quantization + Gradient Checkpointing |
| Environment | Kaggle T4 x2 GPU (Free Tier) |
| Trained Adapter | π€ Karan6124/llama3-8b-dpo-orca-adapter |
ποΈ Project Structure
llama3-dpo-alignment-pipeline/
βββ π configs/
β βββ dpo_config.yaml # All DPO training hyperparameters
β βββ benchmark_config.yaml # Test prompts & generation settings
βββ π scripts/
β βββ train_dpo.py # The main training engine (reads from configs/)
βββ π inference/
β βββ inference.py # Load adapter and run interactive inference
βββ π evaluation/
β βββ benchmark.py # Compare Base vs. Aligned model side-by-side
βββ π training/
β βββ training-llama3-dpo.ipynb # The original Kaggle notebook
βββ π models/ # (gitignored) Local adapter weights live here
βββ pyproject.toml # Dependency management with uv
βββ README.md
π οΈ Setup & Installation
This project uses uv β the fastest Python package manager. Everything is managed in one command.
1. Install uv
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
2. Clone the Repo & Sync Dependencies
git clone https://github.com/Edge-Explorer/llama3-dpo-alignment-pipeline.git
cd llama3-dpo-alignment-pipeline
uv sync
Note: Unsloth requires an NVIDIA GPU to import. For local development, you can write and review code without a GPU. Use Kaggle or Google Colab to actually run the scripts.
ποΈ Training
All training parameters are in configs/dpo_config.yaml. You can tweak learning rates, batch sizes, and sequence lengths without touching any Python code.
# On Kaggle or a GPU machine:
uv run python scripts/train_dpo.py
Key Hyperparameters (optimized for T4 GPU):
beta: 0.1 (DPO temperature)learning_rate: 5e-6per_device_train_batch_size: 1gradient_accumulation_steps: 8max_length: 768
π€ Inference
Download the trained adapter from Hugging Face and load it locally.
# Make sure the adapter is in models/llama3_dpo_adapter/
uv run python inference/inference.py
Or use it directly in your own code:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Karan6124/llama3-8b-dpo-orca-adapter",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Why use DPO over SFT?"}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
π Benchmarking
Run the benchmark script on Kaggle to generate a comparison report:
uv run python evaluation/benchmark.py
Results are saved automatically to evaluation/benchmark_report.txt.
π Key Lessons Learned
| Problem | Solution |
|---|---|
transformers 5.0.0 broke trl in Colab |
Switched to Kaggle's stable environment |
DPOConfig not found in old trl |
Pinned to trl>=0.12.0 |
OutOfMemoryError on T4 GPU |
Reduced batch size to 1, enabled gradient checkpointing |
| Slow training | Unsloth's PatchDPOTrainer gave ~2x speedup |
| Messy notebook workflow | Refactored into reusable scripts + YAML configs |
π License
This project is licensed under the MIT License β see LICENSE for details.
π Acknowledgements
- Unsloth for making LLM fine-tuning incredibly fast
- TRL by HuggingFace for the DPO implementation
- Intel/orca_dpo_pairs for the training dataset
- Kaggle for the free GPUs that made this possible
- Downloads last month
- 1
Model tree for Karan6124/llama3-8b-dpo-orca-adapter
Base model
unsloth/llama-3-8b-Instruct-bnb-4bit