--- license: mit base_model: unsloth/Qwen2.5-3B-bnb-4bit tags: - dpo - lora - peft - qwen2.5 - vietnamese - alignment - kaggle-t4 datasets: - 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated - argilla/ultrafeedback-binarized-preferences-cleaned --- # Lab22 DPO Vietnamese Alignment Adapter This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model. ## Base Model - Base model: `unsloth/Qwen2.5-3B-bnb-4bit` - Compute tier: Kaggle T4 - Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer - Output type: LoRA/PEFT adapter ## Training Pipeline The model was trained in two stages: 1. SFT-mini adapter - Dataset: `5CD-AI/Vietnamese-alpaca-gpt4-gg-translated` - Slice size: 1,000 samples - Epochs: 1 - Learning rate: `2e-4` - Output: `adapters/sft-mini` 2. DPO adapter - Dataset: `argilla/ultrafeedback-binarized-preferences-cleaned` - Slice size: 2,000 preference pairs - Epochs: 1 - DPO beta: `0.1` - Learning rate: `5e-7` - Loss type: sigmoid - Output: `adapters/dpo` ## LoRA Hyperparameters - LoRA rank: `16` - LoRA alpha: `32` - LoRA dropout: `0.0` - Gradient checkpointing: `unsloth` - Quantized base: 4-bit ## Evaluation Results ### NB4 Judge Evaluation 8 fixed prompts were evaluated across helpfulness and safety. | Category | SFT-only | SFT+DPO | Tie | |---|---:|---:|---:| | Overall | 1/8 | 1/8 | 6/8 | | Helpfulness | 0/4 | 1/4 | 3/4 | | Safety | 1/4 | 0/4 | 3/4 | Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run. ### NB6 Benchmark Results Small benchmark slices were used for the lab smoke evaluation. | Benchmark | SFT-only | SFT+DPO | Delta | |---|---:|---:|---:| | IFEval | 0.000 | 0.000 | +0.000 | | GSM8K | 1.000 | 1.000 | +0.000 | | MMLU | 0.667 | 0.667 | +0.000 | | AlpacaEval-lite | 0.500 | 0.500 | +0.000 | The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets. ## GGUF Export The merged SFT+DPO model was exported to GGUF for local serving. - GGUF output: `Qwen2.5-3B.Q4_K_M.gguf` - Smoke test: llama-cpp-python generated a coherent Vietnamese response. ## Intended Use This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation. ## Limitations - Trained on a small 2k preference slice. - Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses. - Evaluation uses small sampled benchmark subsets. - The adapter should not be used for production safety-critical applications.