PEFT
Safetensors
dpo
lora
qwen2.5
vietnamese
alignment
kaggle-t4

Lab22 DPO Vietnamese Alignment Adapter

This repository contains the DPO LoRA adapter trained for Lab 22 DPO/ORPO Alignment. The run was executed on Kaggle T4 using Qwen2.5-3B as the base model.

Base Model

  • Base model: unsloth/Qwen2.5-3B-bnb-4bit
  • Compute tier: Kaggle T4
  • Training stack: Unsloth, PEFT LoRA, TRL DPOTrainer
  • Output type: LoRA/PEFT adapter

Training Pipeline

The model was trained in two stages:

  1. SFT-mini adapter

    • Dataset: 5CD-AI/Vietnamese-alpaca-gpt4-gg-translated
    • Slice size: 1,000 samples
    • Epochs: 1
    • Learning rate: 2e-4
    • Output: adapters/sft-mini
  2. DPO adapter

    • Dataset: argilla/ultrafeedback-binarized-preferences-cleaned
    • Slice size: 2,000 preference pairs
    • Epochs: 1
    • DPO beta: 0.1
    • Learning rate: 5e-7
    • Loss type: sigmoid
    • Output: adapters/dpo

LoRA Hyperparameters

  • LoRA rank: 16
  • LoRA alpha: 32
  • LoRA dropout: 0.0
  • Gradient checkpointing: unsloth
  • Quantized base: 4-bit

Evaluation Results

NB4 Judge Evaluation

8 fixed prompts were evaluated across helpfulness and safety.

Category SFT-only SFT+DPO Tie
Overall 1/8 1/8 6/8
Helpfulness 0/4 1/4 3/4
Safety 1/4 0/4 3/4

Judge output showed that DPO improved one helpfulness case, while SFT-only won one safety case. Most prompts were ties, suggesting the DPO adapter made limited but measurable behavioral changes in this short T4 run.

NB6 Benchmark Results

Small benchmark slices were used for the lab smoke evaluation.

Benchmark SFT-only SFT+DPO Delta
IFEval 0.000 0.000 +0.000
GSM8K 1.000 1.000 +0.000
MMLU 0.667 0.667 +0.000
AlpacaEval-lite 0.500 0.500 +0.000

The benchmark scores were flat in this run, which is expected for a small DPO experiment with limited preference data and sampled evaluation sets.

GGUF Export

The merged SFT+DPO model was exported to GGUF for local serving.

  • GGUF output: Qwen2.5-3B.Q4_K_M.gguf
  • Smoke test: llama-cpp-python generated a coherent Vietnamese response.

Intended Use

This adapter is a course lab artifact for studying preference learning and alignment behavior. It is suitable for demonstration, qualitative comparison, and small-scale experimentation.

Limitations

  • Trained on a small 2k preference slice.
  • Preference data is English UltraFeedback, while the target behavior includes Vietnamese responses.
  • Evaluation uses small sampled benchmark subsets.
  • The adapter should not be used for production safety-critical applications.
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for StevenMup/lab22-dpo-vn

Base model

Qwen/Qwen2.5-3B
Adapter
(46)
this model

Datasets used to train StevenMup/lab22-dpo-vn