Qwen3-0.6B DPO Training
Direct Preference Optimization (DPO) training for Qwen3-0.6B model.
What is DPO?
DPO (Direct Preference Optimization) is a method to train language models using preference data. Instead of training a separate reward model, DPO directly optimizes the model to prefer "chosen" responses over "rejected" responses.
Quick Start
1. Install Dependencies
pip install torch transformers peft trl datasets
2. Run Training with Default Dataset
python train_dpo_qwen3.py --beta 0.1 --epochs 3 --lr 2e-5
3. Run Training with Custom Data
python train_dpo_qwen3.py --dataset your_data.jsonl --beta 0.05
Dataset Format
Your JSONL file should have this format:
{"prompt": "Question?", "chosen": "Good answer", "rejected": "Bad answer"}
{"prompt": "Another question?", "chosen": "Good response", "rejected": "Bad response"}
Training Parameters
| Parameter | Default | Description |
|---|---|---|
--beta |
0.1 | KL penalty coefficient (0.05-0.2) |
--epochs |
3 | Number of training epochs |
--lr |
2e-5 | Learning rate |
--lora_r |
16 | LoRA rank |
--lora_alpha |
16 | LoRA alpha |
--batch_size |
4 | Batch size per device |
--max_length |
1024 | Maximum sequence length |
--max_samples |
1000 | Maximum training samples |
Example Commands
Basic Training
python train_dpo_qwen3.py
Training with Beta Sweep
# Try different beta values
python train_dpo_qwen3.py --beta 0.05 --output_dir ./dpo-beta-0.05
python train_dpo_qwen3.py --beta 0.10 --output_dir ./dpo-beta-0.10
python train_dpo_qwen3.py --beta 0.15 --output_dir ./dpo-beta-0.15
Training with More Data
python train_dpo_qwen3.py --max_samples 5000 --epochs 2
Push to HuggingFace
python train_dpo_qwen3.py --push_to_hub "your-username/Qwen3-0.6B-DPO"
Output
The trained model will be saved to:
./qwen3-0.6b-dpo/(default)- LoRA adapters that can be merged with base model
- Ready for quantization and deployment
Next Steps
After DPO training:
- Merge LoRA adapters:
python merge_lora.py - Quantize: Use your existing quantization scripts
- Deploy: Upload to HuggingFace or use locally