Lab22 DPO Vietnamese Alignment Adapter

This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.

Base Model

Base model: unsloth/Qwen2.5-3B-bnb-4bit
Compute tier: Kaggle T4
Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
Output type: LoRA/PEFT adapter

Training Pipeline

The model was trained in two stages:

SFT-mini adapter
- Dataset: 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated
- Slice size: 1,000 samples
- Epochs: 1
- Learning rate: 2e-4
- Output: adapters/sft-mini
DPO adapter
- Dataset: argilla/ultrafeedback-binarized-preferences-cleaned
- Slice size: 2,000 preference pairs
- Epochs: 1
- DPO beta: 0.1
- Learning rate: 5e-7
- Loss type: sigmoid
- Output: adapters/dpo

LoRA Hyperparameters

LoRA rank: 16
LoRA alpha: 32
LoRA dropout: 0.0
Gradient checkpointing: unsloth
Quantized base: 4-bit

Evaluation Results

NB4 Judge Evaluation

8 fixed prompts were evaluated across helpfulness and safety.

Category	SFT-only	SFT+DPO	Tie
Overall	1/8	1/8	6/8
Helpfulness	0/4	1/4	3/4
Safety	1/4	0/4	3/4

Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.

NB6 Benchmark Results

Small benchmark slices were used for the lab smoke evaluation.

Benchmark	SFT-only	SFT+DPO	Delta
IFEval	0.000	0.000	+0.000
GSM8K	1.000	1.000	+0.000
MMLU	0.667	0.667	+0.000
AlpacaEval-lite	0.500	0.500	+0.000

The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.

GGUF Export

The merged SFT+DPO model was exported to GGUF for local serving.

GGUF output: Qwen2.5-3B.Q4_K_M.gguf
Smoke test: llama-cpp-python generated a coherent Vietnamese response.

Intended Use

This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.

Limitations

Trained on a small 2k preference slice.
Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
Evaluation uses small sampled benchmark subsets.
The adapter should not be used for production safety-critical applications.

Downloads last month: 24

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for StevenMup/lab22-dpo-vn

Base model

Qwen/Qwen2.5-3B

Quantized

unsloth/Qwen2.5-3B-bnb-4bit

Adapter

(46)

this model

StevenMup
/

lab22-dpo-vn