LLM2026_DPO_SFT19_v8
This model is a LoRA adapter fine-tuned from unsloth/Qwen2.5-7B-Instruct-bnb-4bit using Direct Preference Optimization (DPO). The base weights for training were from SFT model: makotonlo/LLM2026_SFT_finalv19_7B.
Training Objective
Optimized for high-quality JSON output and logical reasoning through DPO.
Training Configuration
- Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
- Method: DPO (Direct Preference Optimization)
- Max Steps: 500
- Learning rate: 1e-05
- Beta: 0.5
- LoRA Config: r=64, alpha=64
Usage
This is a LoRA adapter. You can load it using unsloth or vLLM by pointing to this repository.
Framework versions
- PEFT 0.13.2
- Downloads last month
- 1