exp29-dpo-epoch2
This model is a fine-tuned version of ogwata/exp21-sft-dpo-lr7e7-beta02 using Direct Preference Optimization (DPO) via the Unsloth library.
This repository contains the full-merged 16-bit weights. No adapter loading is required.
Training Configuration
- Base model: ogwata/exp21-sft-dpo-lr7e7-beta02
- Method: DPO (Direct Preference Optimization)
- Epochs: 2
- Learning rate: 7e-07
- Beta: 0.2
- Max sequence length: 1024
- LoRA Config: r=8, alpha=16 (merged into base)
- Downloads last month
- 38