exp5a: DPO on exp2a SFT

DPO fine-tuned from exp2a (SFT: u-10bei 1-5, 3 epochs, score 0.708).

Training

  • SFT base: kawatoshi3/exp2a-lora
  • DPO dataset: u-10bei/dpo-dataset-qwen-cot
  • DPO lr: 1e-7, beta: 0.1, epochs: 1
  • LoRA: r=8, alpha=16
  • This is a merged 16bit model (no adapter loading needed)
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kawatoshi3/exp5a-dpo-lora

Finetuned
(1704)
this model

Dataset used to train kawatoshi3/exp5a-dpo-lora