Instructions to use StevenMup/lab22-dpo-vn with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use StevenMup/lab22-dpo-vn with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-bnb-4bit") model = PeftModel.from_pretrained(base_model, "StevenMup/lab22-dpo-vn") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| base_model: unsloth/Qwen2.5-3B-bnb-4bit | |
| tags: | |
| - dpo | |
| - lora | |
| - peft | |
| - qwen2.5 | |
| - vietnamese | |
| - alignment | |
| - kaggle-t4 | |
| datasets: | |
| - 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated | |
| - argilla/ultrafeedback-binarized-preferences-cleaned | |
| # Lab22 DPO Vietnamese Alignment Adapter | |
| This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model. | |
| ## Base Model | |
| - Base model: `unsloth/Qwen2.5-3B-bnb-4bit` | |
| - Compute tier: Kaggle T4 | |
| - Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer | |
| - Output type: LoRA/PEFT adapter | |
| ## Training Pipeline | |
| The model was trained in two stages: | |
| 1. SFT-mini adapter | |
| - Dataset: `5CD-AI/Vietnamese-alpaca-gpt4-gg-translated` | |
| - Slice size: 1,000 samples | |
| - Epochs: 1 | |
| - Learning rate: `2e-4` | |
| - Output: `adapters/sft-mini` | |
| 2. DPO adapter | |
| - Dataset: `argilla/ultrafeedback-binarized-preferences-cleaned` | |
| - Slice size: 2,000 preference pairs | |
| - Epochs: 1 | |
| - DPO beta: `0.1` | |
| - Learning rate: `5e-7` | |
| - Loss type: sigmoid | |
| - Output: `adapters/dpo` | |
| ## LoRA Hyperparameters | |
| - LoRA rank: `16` | |
| - LoRA alpha: `32` | |
| - LoRA dropout: `0.0` | |
| - Gradient checkpointing: `unsloth` | |
| - Quantized base: 4-bit | |
| ## Evaluation Results | |
| ### NB4 Judge Evaluation | |
| 8 fixed prompts were evaluated across helpfulness and safety. | |
| | Category | SFT-only | SFT+DPO | Tie | | |
| |---|---:|---:|---:| | |
| | Overall | 1/8 | 1/8 | 6/8 | | |
| | Helpfulness | 0/4 | 1/4 | 3/4 | | |
| | Safety | 1/4 | 0/4 | 3/4 | | |
| Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run. | |
| ### NB6 Benchmark Results | |
| Small benchmark slices were used for the lab smoke evaluation. | |
| | Benchmark | SFT-only | SFT+DPO | Delta | | |
| |---|---:|---:|---:| | |
| | IFEval | 0.000 | 0.000 | +0.000 | | |
| | GSM8K | 1.000 | 1.000 | +0.000 | | |
| | MMLU | 0.667 | 0.667 | +0.000 | | |
| | AlpacaEval-lite | 0.500 | 0.500 | +0.000 | | |
| The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets. | |
| ## GGUF Export | |
| The merged SFT+DPO model was exported to GGUF for local serving. | |
| - GGUF output: `Qwen2.5-3B.Q4_K_M.gguf` | |
| - Smoke test: llama-cpp-python generated a coherent Vietnamese response. | |
| ## Intended Use | |
| This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation. | |
| ## Limitations | |
| - Trained on a small 2k preference slice. | |
| - Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses. | |
| - Evaluation uses small sampled benchmark subsets. | |
| - The adapter should not be used for production safety-critical applications. | |