exp007-dpo-fixed

DPO with length bias fixed dataset (sample exclusion method).

Training Configuration

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Method: DPO (Direct Preference Optimization)
  • Dataset Fix: Excluded samples where Rejected CoT > 1.5x Chosen CoT
  • Epochs: 1
  • Learning rate: 1e-07
  • Beta: 0.1
  • Max sequence length: 1024
Downloads last month
33
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chattso-GPT/exp007-dpo-fixed

Finetuned
(857)
this model

Dataset used to train Chattso-GPT/exp007-dpo-fixed