PEFT
Safetensors
Vietnamese
English
dpo
alignment
lora
qwen2.5
vinuni-lab22

Lab 22 DPO Adapter — lab22-dpo-adapter-adapter

DPO LoRA adapter trained on top of an SFT-mini Qwen2.5 base for the VinUni AICB Day 22 alignment lab. Stack: Unsloth + TRL DPOTrainer.

Training details

Field Value
Base model unsloth/Qwen2.5-3B-bnb-4bit
Compute tier T4
SFT predecessor 1k VN Alpaca (5CD-AI/Vietnamese-alpaca-cleaned)
Preference dataset argilla/ultrafeedback-binarized-preferences-cleaned (2k slice)
DPO β 0.1
DPO learning rate 5e-07
Epochs 1
LoRA r / alpha 16 / 32
Max sequence length 512

Results

Metric Value
Final training loss 0.8086
End chosen reward -0.873
End rejected reward -0.948
End reward gap +0.075

Usage

from peft import PeftModel
from unsloth import FastLanguageModel

model, tok = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-3B-bnb-4bit", load_in_4bit=True, max_seq_length=512
)
model = PeftModel.from_pretrained(model, "Wan1302/lab22-dpo-adapter-adapter")

License & limitations

  • Base license: Apache-2.0 (Qwen2.5).
  • This is an experimental research adapter. Not production-ready.
  • Trained on English UltraFeedback; safety alignment tested on 8 VN prompts (see lab repo NB4). Use with care for deployment-critical workloads.

Citation

VinUni AICB program · Track 3 Day 22 · A20 cohort 2026.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wan1302/lab22-dpo-adapter-adapter

Base model

Qwen/Qwen2.5-3B
Adapter
(46)
this model

Dataset used to train Wan1302/lab22-dpo-adapter-adapter