argilla/ultrafeedback-binarized-preferences-cleaned
Viewer • Updated • 60.9k • 14.3k • 162
How to use Wan1302/lab22-dpo-adapter-adapter with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-bnb-4bit")
model = PeftModel.from_pretrained(base_model, "Wan1302/lab22-dpo-adapter-adapter")DPO LoRA adapter trained on top of an SFT-mini Qwen2.5 base for the VinUni AICB
Day 22 alignment lab. Stack: Unsloth + TRL DPOTrainer.
| Field | Value |
|---|---|
| Base model | unsloth/Qwen2.5-3B-bnb-4bit |
| Compute tier | T4 |
| SFT predecessor | 1k VN Alpaca (5CD-AI/Vietnamese-alpaca-cleaned) |
| Preference dataset | argilla/ultrafeedback-binarized-preferences-cleaned (2k slice) |
| DPO β | 0.1 |
| DPO learning rate | 5e-07 |
| Epochs | 1 |
| LoRA r / alpha | 16 / 32 |
| Max sequence length | 512 |
| Metric | Value |
|---|---|
| Final training loss | 0.8086 |
| End chosen reward | -0.873 |
| End rejected reward | -0.948 |
| End reward gap | +0.075 |
from peft import PeftModel
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained(
"unsloth/Qwen2.5-3B-bnb-4bit", load_in_4bit=True, max_seq_length=512
)
model = PeftModel.from_pretrained(model, "Wan1302/lab22-dpo-adapter-adapter")
VinUni AICB program · Track 3 Day 22 · A20 cohort 2026.