Upload README.md with huggingface_hub

d21f433 verified 3 months ago

593 Bytes

base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
  - u-10bei/dpo-dataset-qwen-cot
language:
  - en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - dpo
  - unsloth
  - qwen
  - alignment

exp007-dpo-fixed

DPO with length bias fixed dataset (sample exclusion method).

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
Method: DPO (Direct Preference Optimization)
Dataset Fix: Excluded samples where Rejected CoT > 1.5x Chosen CoT
Epochs: 1
Learning rate: 1e-07
Beta: 0.1
Max sequence length: 1024