---
license: mit
base_model: unsloth/Qwen2.5-3B-bnb-4bit
tags:
- dpo
- lora
- peft
- qwen2.5
- vietnamese
- alignment
- kaggle-t4
datasets:
- 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated
- argilla/ultrafeedback-binarized-preferences-cleaned
---

# Lab22 DPO Vietnamese Alignment Adapter

This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.

## Base Model

- Base model: `unsloth/Qwen2.5-3B-bnb-4bit`
- Compute tier: Kaggle T4
- Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
- Output type: LoRA/PEFT adapter

## Training Pipeline

The model was trained in two stages:

1. SFT-mini adapter
   - Dataset: `5CD-AI/Vietnamese-alpaca-gpt4-gg-translated`
   - Slice size: 1,000 samples
   - Epochs: 1
   - Learning rate: `2e-4`
   - Output: `adapters/sft-mini`

2. DPO adapter
   - Dataset: `argilla/ultrafeedback-binarized-preferences-cleaned`
   - Slice size: 2,000 preference pairs
   - Epochs: 1
   - DPO beta: `0.1`
   - Learning rate: `5e-7`
   - Loss type: sigmoid
   - Output: `adapters/dpo`

## LoRA Hyperparameters

- LoRA rank: `16`
- LoRA alpha: `32`
- LoRA dropout: `0.0`
- Gradient checkpointing: `unsloth`
- Quantized base: 4-bit

## Evaluation Results

### NB4 Judge Evaluation

8 fixed prompts were evaluated across helpfulness and safety.

| Category | SFT-only | SFT+DPO | Tie |
|---|---:|---:|---:|
| Overall | 1/8 | 1/8 | 6/8 |
| Helpfulness | 0/4 | 1/4 | 3/4 |
| Safety | 1/4 | 0/4 | 3/4 |

Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.

### NB6 Benchmark Results

Small benchmark slices were used for the lab smoke evaluation.

| Benchmark | SFT-only | SFT+DPO | Delta |
|---|---:|---:|---:|
| IFEval | 0.000 | 0.000 | +0.000 |
| GSM8K | 1.000 | 1.000 | +0.000 |
| MMLU | 0.667 | 0.667 | +0.000 |
| AlpacaEval-lite | 0.500 | 0.500 | +0.000 |

The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.

## GGUF Export

The merged SFT+DPO model was exported to GGUF for local serving.

- GGUF output: `Qwen2.5-3B.Q4_K_M.gguf`
- Smoke test: llama-cpp-python generated a coherent Vietnamese response.

## Intended Use

This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.

## Limitations

- Trained on a small 2k preference slice.
- Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
- Evaluation uses small sampled benchmark subsets.
- The adapter should not be used for production safety-critical applications.