Qwen3-0.6B DPO Training

Direct Preference Optimization (DPO) training for Qwen3-0.6B model.

What is DPO?

DPO (Direct Preference Optimization) is a method to train language models using preference data. Instead of training a separate reward model, DPO directly optimizes the model to prefer "chosen" responses over "rejected" responses.

Quick Start

1. Install Dependencies

pip install torch transformers peft trl datasets

2. Run Training with Default Dataset

python train_dpo_qwen3.py --beta 0.1 --epochs 3 --lr 2e-5

3. Run Training with Custom Data

python train_dpo_qwen3.py --dataset your_data.jsonl --beta 0.05

Dataset Format

Your JSONL file should have this format:

{"prompt": "Question?", "chosen": "Good answer", "rejected": "Bad answer"}
{"prompt": "Another question?", "chosen": "Good response", "rejected": "Bad response"}

Training Parameters

Parameter	Default	Description
`--beta`	0.1	KL penalty coefficient (0.05-0.2)
`--epochs`	3	Number of training epochs
`--lr`	2e-5	Learning rate
`--lora_r`	16	LoRA rank
`--lora_alpha`	16	LoRA alpha
`--batch_size`	4	Batch size per device
`--max_length`	1024	Maximum sequence length
`--max_samples`	1000	Maximum training samples

Example Commands

Basic Training

python train_dpo_qwen3.py

Training with Beta Sweep

# Try different beta values
python train_dpo_qwen3.py --beta 0.05 --output_dir ./dpo-beta-0.05
python train_dpo_qwen3.py --beta 0.10 --output_dir ./dpo-beta-0.10
python train_dpo_qwen3.py --beta 0.15 --output_dir ./dpo-beta-0.15

Training with More Data

python train_dpo_qwen3.py --max_samples 5000 --epochs 2

Push to HuggingFace

python train_dpo_qwen3.py --push_to_hub "your-username/Qwen3-0.6B-DPO"

Output

The trained model will be saved to:

./qwen3-0.6b-dpo/ (default)
LoRA adapters that can be merged with base model
Ready for quantization and deployment

Next Steps

After DPO training:

Merge LoRA adapters: python merge_lora.py
Quantize: Use your existing quantization scripts
Deploy: Upload to HuggingFace or use locally

Bopalv
/

Qwen3-0.6B-quantized