Lab22 DPO Vietnamese Alignment Adapter
This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.
Base Model
- Base model:
unsloth/Qwen2.5-3B-bnb-4bit - Compute tier: Kaggle T4
- Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
- Output type: LoRA/PEFT adapter
Training Pipeline
The model was trained in two stages:
SFT-mini adapter
- Dataset:
5CD-AI/Vietnamese-alpaca-gpt4-gg-translated - Slice size: 1,000 samples
- Epochs: 1
- Learning rate:
2e-4 - Output:
adapters/sft-mini
- Dataset:
DPO adapter
- Dataset:
argilla/ultrafeedback-binarized-preferences-cleaned - Slice size: 2,000 preference pairs
- Epochs: 1
- DPO beta:
0.1 - Learning rate:
5e-7 - Loss type: sigmoid
- Output:
adapters/dpo
- Dataset:
LoRA Hyperparameters
- LoRA rank:
16 - LoRA alpha:
32 - LoRA dropout:
0.0 - Gradient checkpointing:
unsloth - Quantized base: 4-bit
Evaluation Results
NB4 Judge Evaluation
8 fixed prompts were evaluated across helpfulness and safety.
| Category | SFT-only | SFT+DPO | Tie |
|---|---|---|---|
| Overall | 1/8 | 1/8 | 6/8 |
| Helpfulness | 0/4 | 1/4 | 3/4 |
| Safety | 1/4 | 0/4 | 3/4 |
Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.
NB6 Benchmark Results
Small benchmark slices were used for the lab smoke evaluation.
| Benchmark | SFT-only | SFT+DPO | Delta |
|---|---|---|---|
| IFEval | 0.000 | 0.000 | +0.000 |
| GSM8K | 1.000 | 1.000 | +0.000 |
| MMLU | 0.667 | 0.667 | +0.000 |
| AlpacaEval-lite | 0.500 | 0.500 | +0.000 |
The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.
GGUF Export
The merged SFT+DPO model was exported to GGUF for local serving.
- GGUF output:
Qwen2.5-3B.Q4_K_M.gguf - Smoke test: llama-cpp-python generated a coherent Vietnamese response.
Intended Use
This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.
Limitations
- Trained on a small 2k preference slice.
- Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
- Evaluation uses small sampled benchmark subsets.
- The adapter should not be used for production safety-critical applications.
- Downloads last month
- 24