Day 22 · DPO Alignment Lab — DPO Adapter

This repository contains the DPO LoRA adapter trained in the Lab 22 pipeline for Track 3.

Model summary

Base model: unsloth/Qwen2.5-3B-bnb-4bit
Training stack: Unsloth + TRL DPOTrainer
Adapter output: adapters/dpo/
Purpose: align a lightweight Qwen2.5-3B checkpoint with preference data while keeping the deployment footprint small

This model card describes the adapter that is pushed to Hugging Face Hub in Option B (Professional). The notebook also builds an upstream SFT-mini checkpoint before DPO.

Training data

Stage 1: SFT-mini

Dataset: 5CD-AI/Vietnamese-Multi-turn-Chat-Alpaca
Slice: 1000 samples
Epochs: 1
LoRA: r=16, lora_alpha=32
Sequence length: 512
Batch size: 1
Gradient accumulation: 8
Learning rate: 2e-4

Stage 2: Preference data for DPO

Dataset: argilla/ultrafeedback-binarized-preferences-cleaned
Slice: 2000 preference pairs
Format: prompt, chosen, rejected
Sequence length: 512
Max prompt length: 256

Stage 3: DPO

Trainer: trl.DPOTrainer
Beta: 0.1
Learning rate: 5e-7
Epochs: 1
Batch size: 1
Gradient accumulation: 8

What is included

The uploaded adapter folder typically contains:

adapter_config.json
adapter_model.safetensors
tokenizer metadata files generated by the training stack

Evaluation

The notebook records the following confirmed results for the DPO stage:

Final chosen reward: -1.277
Final rejected reward: -1.525
Final reward gap: +0.249

The notebook also includes:

side-by-side generation comparison for 8 prompts
merged FP16 export
GGUF Q4_K_M conversion
benchmark code for IFEval, GSM8K, MMLU, and AlpacaEval-lite

Note: the uploaded notebook snapshot does not include the final Stage 6 benchmark numbers in its saved outputs. Fill in the table below after running NB6 end-to-end.

Benchmark	SFT-only	SFT + DPO
IFEval	TBD	TBD
GSM8K	TBD	TBD
MMLU	TBD	TBD
AlpacaEval-lite	TBD	TBD

Intended use

This adapter is intended for:

instruction following
preference-aligned chat generation
lightweight experimentation on top of the Qwen2.5-3B family

Limitations

The preference data is English UltraFeedback, while the SFT warm start is Vietnamese chat data.
The notebook run shown here is a T4-tier configuration, so the training and evaluation footprint is intentionally small.
This adapter is not a full base model; it must be loaded on top of the corresponding Qwen2.5-3B base checkpoint or merged export.

Usage

Load the adapter with the same base model used during training, then generate with the standard chat template for Qwen2.5.

License

Apache-2.0, following the upstream model and common lab convention unless your course repository specifies otherwise.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for datnguyennn/day22-dpo-alignment

Base model

Qwen/Qwen2.5-3B

Quantized

unsloth/Qwen2.5-3B-bnb-4bit

Adapter

(46)

this model