metadata
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- u-10bei/dpo-dataset-qwen-cot
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- unsloth
- qwen
- alignment
exp007-dpo-fixed
DPO with length bias fixed dataset (sample exclusion method).
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Method: DPO (Direct Preference Optimization)
- Dataset Fix: Excluded samples where Rejected CoT > 1.5x Chosen CoT
- Epochs: 1
- Learning rate: 1e-07
- Beta: 0.1
- Max sequence length: 1024