exp007-dpo-fixed
DPO with length bias fixed dataset (sample exclusion method).
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Method: DPO (Direct Preference Optimization)
- Dataset Fix: Excluded samples where Rejected CoT > 1.5x Chosen CoT
- Epochs: 1
- Learning rate: 1e-07
- Beta: 0.1
- Max sequence length: 1024
- Downloads last month
- 33
Model tree for Chattso-GPT/exp007-dpo-fixed
Base model
Qwen/Qwen3-4B-Instruct-2507