Instructions to use datnguyennn/day22-dpo-alignment with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use datnguyennn/day22-dpo-alignment with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="datnguyennn/day22-dpo-alignment")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("datnguyennn/day22-dpo-alignment", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use datnguyennn/day22-dpo-alignment with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "datnguyennn/day22-dpo-alignment" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datnguyennn/day22-dpo-alignment", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/datnguyennn/day22-dpo-alignment
- SGLang
How to use datnguyennn/day22-dpo-alignment with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "datnguyennn/day22-dpo-alignment" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datnguyennn/day22-dpo-alignment", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "datnguyennn/day22-dpo-alignment" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datnguyennn/day22-dpo-alignment", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use datnguyennn/day22-dpo-alignment with Docker Model Runner:
docker model run hf.co/datnguyennn/day22-dpo-alignment
Day 22 · DPO Alignment Lab — DPO Adapter
This repository contains the DPO LoRA adapter trained in the Lab 22 pipeline for Track 3.
Model summary
- Base model:
unsloth/Qwen2.5-3B-bnb-4bit - Training stack: Unsloth + TRL
DPOTrainer - Adapter output:
adapters/dpo/ - Purpose: align a lightweight Qwen2.5-3B checkpoint with preference data while keeping the deployment footprint small
This model card describes the adapter that is pushed to Hugging Face Hub in Option B (Professional). The notebook also builds an upstream SFT-mini checkpoint before DPO.
Training data
Stage 1: SFT-mini
- Dataset:
5CD-AI/Vietnamese-Multi-turn-Chat-Alpaca - Slice:
1000samples - Epochs:
1 - LoRA:
r=16,lora_alpha=32 - Sequence length:
512 - Batch size:
1 - Gradient accumulation:
8 - Learning rate:
2e-4
Stage 2: Preference data for DPO
- Dataset:
argilla/ultrafeedback-binarized-preferences-cleaned - Slice:
2000preference pairs - Format:
prompt,chosen,rejected - Sequence length:
512 - Max prompt length:
256
Stage 3: DPO
- Trainer:
trl.DPOTrainer - Beta:
0.1 - Learning rate:
5e-7 - Epochs:
1 - Batch size:
1 - Gradient accumulation:
8
What is included
The uploaded adapter folder typically contains:
adapter_config.jsonadapter_model.safetensors- tokenizer metadata files generated by the training stack
Evaluation
The notebook records the following confirmed results for the DPO stage:
- Final chosen reward:
-1.277 - Final rejected reward:
-1.525 - Final reward gap:
+0.249
The notebook also includes:
- side-by-side generation comparison for 8 prompts
- merged FP16 export
- GGUF Q4_K_M conversion
- benchmark code for IFEval, GSM8K, MMLU, and AlpacaEval-lite
Note: the uploaded notebook snapshot does not include the final Stage 6 benchmark numbers in its saved outputs. Fill in the table below after running NB6 end-to-end.
| Benchmark | SFT-only | SFT + DPO |
|---|---|---|
| IFEval | TBD | TBD |
| GSM8K | TBD | TBD |
| MMLU | TBD | TBD |
| AlpacaEval-lite | TBD | TBD |
Intended use
This adapter is intended for:
- instruction following
- preference-aligned chat generation
- lightweight experimentation on top of the Qwen2.5-3B family
Limitations
- The preference data is English UltraFeedback, while the SFT warm start is Vietnamese chat data.
- The notebook run shown here is a T4-tier configuration, so the training and evaluation footprint is intentionally small.
- This adapter is not a full base model; it must be loaded on top of the corresponding Qwen2.5-3B base checkpoint or merged export.
Usage
Load the adapter with the same base model used during training, then generate with the standard chat template for Qwen2.5.
License
Apache-2.0, following the upstream model and common lab convention unless your course repository specifies otherwise.