exp007-dpo-fixed / README.md
Chattso-GPT's picture
Upload README.md with huggingface_hub
d21f433 verified
metadata
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
  - u-10bei/dpo-dataset-qwen-cot
language:
  - en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - dpo
  - unsloth
  - qwen
  - alignment

exp007-dpo-fixed

DPO with length bias fixed dataset (sample exclusion method).

Training Configuration

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Method: DPO (Direct Preference Optimization)
  • Dataset Fix: Excluded samples where Rejected CoT > 1.5x Chosen CoT
  • Epochs: 1
  • Learning rate: 1e-07
  • Beta: 0.1
  • Max sequence length: 1024