Day 22 · DPO Alignment Lab — DPO Adapter
This repository contains the DPO LoRA adapter trained in the Lab 22 pipeline for Track 3.
Model summary
- Base model:
unsloth/Qwen2.5-3B-bnb-4bit - Training stack: Unsloth + TRL
DPOTrainer - Adapter output:
adapters/dpo/ - Purpose: align a lightweight Qwen2.5-3B checkpoint with preference data while keeping the deployment footprint small
This model card describes the adapter that is pushed to Hugging Face Hub in Option B (Professional). The notebook also builds an upstream SFT-mini checkpoint before DPO.
Training data
Stage 1: SFT-mini
- Dataset:
5CD-AI/Vietnamese-Multi-turn-Chat-Alpaca - Slice:
1000samples - Epochs:
1 - LoRA:
r=16,lora_alpha=32 - Sequence length:
512 - Batch size:
1 - Gradient accumulation:
8 - Learning rate:
2e-4
Stage 2: Preference data for DPO
- Dataset:
argilla/ultrafeedback-binarized-preferences-cleaned - Slice:
2000preference pairs - Format:
prompt,chosen,rejected - Sequence length:
512 - Max prompt length:
256
Stage 3: DPO
- Trainer:
trl.DPOTrainer - Beta:
0.1 - Learning rate:
5e-7 - Epochs:
1 - Batch size:
1 - Gradient accumulation:
8
What is included
The uploaded adapter folder typically contains:
adapter_config.jsonadapter_model.safetensors- tokenizer metadata files generated by the training stack
Evaluation
The notebook records the following confirmed results for the DPO stage:
- Final chosen reward:
-1.277 - Final rejected reward:
-1.525 - Final reward gap:
+0.249
The notebook also includes:
- side-by-side generation comparison for 8 prompts
- merged FP16 export
- GGUF Q4_K_M conversion
- benchmark code for IFEval, GSM8K, MMLU, and AlpacaEval-lite
Note: the uploaded notebook snapshot does not include the final Stage 6 benchmark numbers in its saved outputs. Fill in the table below after running NB6 end-to-end.
| Benchmark | SFT-only | SFT + DPO |
|---|---|---|
| IFEval | TBD | TBD |
| GSM8K | TBD | TBD |
| MMLU | TBD | TBD |
| AlpacaEval-lite | TBD | TBD |
Intended use
This adapter is intended for:
- instruction following
- preference-aligned chat generation
- lightweight experimentation on top of the Qwen2.5-3B family
Limitations
- The preference data is English UltraFeedback, while the SFT warm start is Vietnamese chat data.
- The notebook run shown here is a T4-tier configuration, so the training and evaluation footprint is intentionally small.
- This adapter is not a full base model; it must be loaded on top of the corresponding Qwen2.5-3B base checkpoint or merged export.
Usage
Load the adapter with the same base model used during training, then generate with the standard chat template for Qwen2.5.
License
Apache-2.0, following the upstream model and common lab convention unless your course repository specifies otherwise.