PEFT
Safetensors
dpo
lora
qwen2.5
vietnamese
alignment
kaggle-t4
lab22-dpo-vn / README.md
StevenMup's picture
Upload folder using huggingface_hub
a825a59 verified
---
license: mit
base_model: unsloth/Qwen2.5-3B-bnb-4bit
tags:
- dpo
- lora
- peft
- qwen2.5
- vietnamese
- alignment
- kaggle-t4
datasets:
- 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated
- argilla/ultrafeedback-binarized-preferences-cleaned
---
# Lab22 DPO Vietnamese Alignment Adapter
This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.
## Base Model
- Base model: `unsloth/Qwen2.5-3B-bnb-4bit`
- Compute tier: Kaggle T4
- Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
- Output type: LoRA/PEFT adapter
## Training Pipeline
The model was trained in two stages:
1. SFT-mini adapter
- Dataset: `5CD-AI/Vietnamese-alpaca-gpt4-gg-translated`
- Slice size: 1,000 samples
- Epochs: 1
- Learning rate: `2e-4`
- Output: `adapters/sft-mini`
2. DPO adapter
- Dataset: `argilla/ultrafeedback-binarized-preferences-cleaned`
- Slice size: 2,000 preference pairs
- Epochs: 1
- DPO beta: `0.1`
- Learning rate: `5e-7`
- Loss type: sigmoid
- Output: `adapters/dpo`
## LoRA Hyperparameters
- LoRA rank: `16`
- LoRA alpha: `32`
- LoRA dropout: `0.0`
- Gradient checkpointing: `unsloth`
- Quantized base: 4-bit
## Evaluation Results
### NB4 Judge Evaluation
8 fixed prompts were evaluated across helpfulness and safety.
| Category | SFT-only | SFT+DPO | Tie |
|---|---:|---:|---:|
| Overall | 1/8 | 1/8 | 6/8 |
| Helpfulness | 0/4 | 1/4 | 3/4 |
| Safety | 1/4 | 0/4 | 3/4 |
Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.
### NB6 Benchmark Results
Small benchmark slices were used for the lab smoke evaluation.
| Benchmark | SFT-only | SFT+DPO | Delta |
|---|---:|---:|---:|
| IFEval | 0.000 | 0.000 | +0.000 |
| GSM8K | 1.000 | 1.000 | +0.000 |
| MMLU | 0.667 | 0.667 | +0.000 |
| AlpacaEval-lite | 0.500 | 0.500 | +0.000 |
The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.
## GGUF Export
The merged SFT+DPO model was exported to GGUF for local serving.
- GGUF output: `Qwen2.5-3B.Q4_K_M.gguf`
- Smoke test: llama-cpp-python generated a coherent Vietnamese response.
## Intended Use
This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.
## Limitations
- Trained on a small 2k preference slice.
- Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
- Evaluation uses small sampled benchmark subsets.
- The adapter should not be used for production safety-critical applications.