Day 22 · DPO Alignment Lab — DPO Adapter

This repository contains the DPO LoRA adapter trained in the Lab 22 pipeline for Track 3.

Model summary

  • Base model: unsloth/Qwen2.5-3B-bnb-4bit
  • Training stack: Unsloth + TRL DPOTrainer
  • Adapter output: adapters/dpo/
  • Purpose: align a lightweight Qwen2.5-3B checkpoint with preference data while keeping the deployment footprint small

This model card describes the adapter that is pushed to Hugging Face Hub in Option B (Professional). The notebook also builds an upstream SFT-mini checkpoint before DPO.

Training data

Stage 1: SFT-mini

  • Dataset: 5CD-AI/Vietnamese-Multi-turn-Chat-Alpaca
  • Slice: 1000 samples
  • Epochs: 1
  • LoRA: r=16, lora_alpha=32
  • Sequence length: 512
  • Batch size: 1
  • Gradient accumulation: 8
  • Learning rate: 2e-4

Stage 2: Preference data for DPO

  • Dataset: argilla/ultrafeedback-binarized-preferences-cleaned
  • Slice: 2000 preference pairs
  • Format: prompt, chosen, rejected
  • Sequence length: 512
  • Max prompt length: 256

Stage 3: DPO

  • Trainer: trl.DPOTrainer
  • Beta: 0.1
  • Learning rate: 5e-7
  • Epochs: 1
  • Batch size: 1
  • Gradient accumulation: 8

What is included

The uploaded adapter folder typically contains:

  • adapter_config.json
  • adapter_model.safetensors
  • tokenizer metadata files generated by the training stack

Evaluation

The notebook records the following confirmed results for the DPO stage:

  • Final chosen reward: -1.277
  • Final rejected reward: -1.525
  • Final reward gap: +0.249

The notebook also includes:

  • side-by-side generation comparison for 8 prompts
  • merged FP16 export
  • GGUF Q4_K_M conversion
  • benchmark code for IFEval, GSM8K, MMLU, and AlpacaEval-lite

Note: the uploaded notebook snapshot does not include the final Stage 6 benchmark numbers in its saved outputs. Fill in the table below after running NB6 end-to-end.

Benchmark SFT-only SFT + DPO
IFEval TBD TBD
GSM8K TBD TBD
MMLU TBD TBD
AlpacaEval-lite TBD TBD

Intended use

This adapter is intended for:

  • instruction following
  • preference-aligned chat generation
  • lightweight experimentation on top of the Qwen2.5-3B family

Limitations

  • The preference data is English UltraFeedback, while the SFT warm start is Vietnamese chat data.
  • The notebook run shown here is a T4-tier configuration, so the training and evaluation footprint is intentionally small.
  • This adapter is not a full base model; it must be loaded on top of the corresponding Qwen2.5-3B base checkpoint or merged export.

Usage

Load the adapter with the same base model used during training, then generate with the standard chat template for Qwen2.5.

License

Apache-2.0, following the upstream model and common lab convention unless your course repository specifies otherwise.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for datnguyennn/day22-dpo-alignment

Base model

Qwen/Qwen2.5-3B
Adapter
(46)
this model