LLM2026_DPO_SFT19_v8

This model is a LoRA adapter fine-tuned from unsloth/Qwen2.5-7B-Instruct-bnb-4bit using Direct Preference Optimization (DPO). The base weights for training were from SFT model: makotonlo/LLM2026_SFT_finalv19_7B.

Training Objective

Optimized for high-quality JSON output and logical reasoning through DPO.

Training Configuration

Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
Method: DPO (Direct Preference Optimization)
Max Steps: 500
Learning rate: 1e-05
Beta: 0.5
LoRA Config: r=64, alpha=64

Usage

This is a LoRA adapter. You can load it using unsloth or vLLM by pointing to this repository.

Framework versions

PEFT 0.13.2

Downloads last month: 1