LLM2026_DPO_SFT19_v8

This model is a LoRA adapter fine-tuned from unsloth/Qwen2.5-7B-Instruct-bnb-4bit using Direct Preference Optimization (DPO). The base weights for training were from SFT model: makotonlo/LLM2026_SFT_finalv19_7B.

Training Objective

Optimized for high-quality JSON output and logical reasoning through DPO.

Training Configuration

  • Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
  • Method: DPO (Direct Preference Optimization)
  • Max Steps: 500
  • Learning rate: 1e-05
  • Beta: 0.5
  • LoRA Config: r=64, alpha=64

Usage

This is a LoRA adapter. You can load it using unsloth or vLLM by pointing to this repository.

Framework versions

  • PEFT 0.13.2
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support