dpo_lora_model_stage3
This repository provides a DPO LoRA adapter fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using QLoRA (4-bit, Unsloth).
The DPO adapter was trained on top of the SFT Stage 3 adapter (DLNorb/lora_structeval_t_qwen3_4b_v2_stage3).
This repository contains LoRA adapter weights only. The base model and SFT adapter must be loaded separately.
Training Objective
This adapter applies Direct Preference Optimization (DPO) to improve structured output accuracy (JSON / YAML / XML / TOML / CSV) by aligning the model with human preferences.
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- SFT adapter: lora_structeval_t_qwen3_4b_v2_stage3/checkpoint-100
- Method: DPO with QLoRA (4-bit)
- Max sequence length: 2048
- DPO beta: 0.1
- Learning rate: 1e-07
- LoRA: r=8, alpha=16
- Batch size: 2 (grad accum: 4, effective BS=8)
- Epochs: 1
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "DLNorb/dpo_lora_model_stage3"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(
base,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)
Sources & Terms (IMPORTANT)
Training data:
Compliance: Users must comply with each dataset's license (including copyright notice) and the base model's original terms of use.
- Downloads last month
- 7
Model tree for DLNorb/dpo_lora_model_stage3
Base model
Qwen/Qwen3-4B-Instruct-2507