exp29-dpo-epoch2

This model is a fine-tuned version of ogwata/exp21-sft-dpo-lr7e7-beta02 using Direct Preference Optimization (DPO) via the Unsloth library.

This repository contains the full-merged 16-bit weights. No adapter loading is required.

Training Configuration

  • Base model: ogwata/exp21-sft-dpo-lr7e7-beta02
  • Method: DPO (Direct Preference Optimization)
  • Epochs: 2
  • Learning rate: 7e-07
  • Beta: 0.2
  • Max sequence length: 1024
  • LoRA Config: r=8, alpha=16 (merged into base)
Downloads last month
38
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ogwata/exp29-dpo-epoch2

Finetuned
(1)
this model

Dataset used to train ogwata/exp29-dpo-epoch2