StevenMup
/

lab22-dpo-vn

Model card Files Files and versions

lab22-dpo-vn / README.md

StevenMup's picture

Upload folder using huggingface_hub

a825a59 verified 28 days ago

|

history blame contribute delete

2.87 kB

	---
	license: mit
	base_model: unsloth/Qwen2.5-3B-bnb-4bit
	tags:
	- dpo
	- lora
	- peft
	- qwen2.5
	- vietnamese
	- alignment
	- kaggle-t4
	datasets:
	- 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated
	- argilla/ultrafeedback-binarized-preferences-cleaned
	---

	# Lab22 DPO Vietnamese Alignment Adapter

	This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.

	## Base Model

	- Base model: `unsloth/Qwen2.5-3B-bnb-4bit`
	- Compute tier: Kaggle T4
	- Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
	- Output type: LoRA/PEFT adapter

	## Training Pipeline

	The model was trained in two stages:

	1. SFT-mini adapter
	- Dataset: `5CD-AI/Vietnamese-alpaca-gpt4-gg-translated`
	- Slice size: 1,000 samples
	- Epochs: 1
	- Learning rate: `2e-4`
	- Output: `adapters/sft-mini`

	2. DPO adapter
	- Dataset: `argilla/ultrafeedback-binarized-preferences-cleaned`
	- Slice size: 2,000 preference pairs
	- Epochs: 1
	- DPO beta: `0.1`
	- Learning rate: `5e-7`
	- Loss type: sigmoid
	- Output: `adapters/dpo`

	## LoRA Hyperparameters

	- LoRA rank: `16`
	- LoRA alpha: `32`
	- LoRA dropout: `0.0`
	- Gradient checkpointing: `unsloth`
	- Quantized base: 4-bit

	## Evaluation Results

	### NB4 Judge Evaluation

	8 fixed prompts were evaluated across helpfulness and safety.

	\| Category \| SFT-only \| SFT+DPO \| Tie \|
	\|---\|---:\|---:\|---:\|
	\| Overall \| 1/8 \| 1/8 \| 6/8 \|
	\| Helpfulness \| 0/4 \| 1/4 \| 3/4 \|
	\| Safety \| 1/4 \| 0/4 \| 3/4 \|

	Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.

	### NB6 Benchmark Results

	Small benchmark slices were used for the lab smoke evaluation.

	\| Benchmark \| SFT-only \| SFT+DPO \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| IFEval \| 0.000 \| 0.000 \| +0.000 \|
	\| GSM8K \| 1.000 \| 1.000 \| +0.000 \|
	\| MMLU \| 0.667 \| 0.667 \| +0.000 \|
	\| AlpacaEval-lite \| 0.500 \| 0.500 \| +0.000 \|

	The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.

	## GGUF Export

	The merged SFT+DPO model was exported to GGUF for local serving.

	- GGUF output: `Qwen2.5-3B.Q4_K_M.gguf`
	- Smoke test: llama-cpp-python generated a coherent Vietnamese response.

	## Intended Use

	This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.

	## Limitations

	- Trained on a small 2k preference slice.
	- Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
	- Evaluation uses small sampled benchmark subsets.
	- The adapter should not be used for production safety-critical applications.