Qwen3-0.6B-quantized / DPO-Training /DPO-Training-README.md
Bopalv's picture
Upload DPO-Training/DPO-Training-README.md with huggingface_hub
e7728e1 verified

Qwen3-0.6B DPO Training

Direct Preference Optimization (DPO) training for Qwen3-0.6B model.

What is DPO?

DPO (Direct Preference Optimization) is a method to train language models using preference data. Instead of training a separate reward model, DPO directly optimizes the model to prefer "chosen" responses over "rejected" responses.

Quick Start

1. Install Dependencies

pip install torch transformers peft trl datasets

2. Run Training with Default Dataset

python train_dpo_qwen3.py --beta 0.1 --epochs 3 --lr 2e-5

3. Run Training with Custom Data

python train_dpo_qwen3.py --dataset your_data.jsonl --beta 0.05

Dataset Format

Your JSONL file should have this format:

{"prompt": "Question?", "chosen": "Good answer", "rejected": "Bad answer"}
{"prompt": "Another question?", "chosen": "Good response", "rejected": "Bad response"}

Training Parameters

Parameter Default Description
--beta 0.1 KL penalty coefficient (0.05-0.2)
--epochs 3 Number of training epochs
--lr 2e-5 Learning rate
--lora_r 16 LoRA rank
--lora_alpha 16 LoRA alpha
--batch_size 4 Batch size per device
--max_length 1024 Maximum sequence length
--max_samples 1000 Maximum training samples

Example Commands

Basic Training

python train_dpo_qwen3.py

Training with Beta Sweep

# Try different beta values
python train_dpo_qwen3.py --beta 0.05 --output_dir ./dpo-beta-0.05
python train_dpo_qwen3.py --beta 0.10 --output_dir ./dpo-beta-0.10
python train_dpo_qwen3.py --beta 0.15 --output_dir ./dpo-beta-0.15

Training with More Data

python train_dpo_qwen3.py --max_samples 5000 --epochs 2

Push to HuggingFace

python train_dpo_qwen3.py --push_to_hub "your-username/Qwen3-0.6B-DPO"

Output

The trained model will be saved to:

  • ./qwen3-0.6b-dpo/ (default)
  • LoRA adapters that can be merged with base model
  • Ready for quantization and deployment

Next Steps

After DPO training:

  1. Merge LoRA adapters: python merge_lora.py
  2. Quantize: Use your existing quantization scripts
  3. Deploy: Upload to HuggingFace or use locally

References