Qwen3-4B Structured Data Expert (Exp13 - DPO with System Prompt)
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO).
This repository contains a LoRA adapter trained for structured data generation tasks (JSON, YAML, TOML, XML, CSV, etc.).
Key Feature
Training and inference formats are fully aligned by embedding the system prompt into DPO training data, which significantly improves output quality.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Instruct-2507 + SFT (Exp5) |
| Method | DPO (Direct Preference Optimization) |
| Dataset | u-10bei/dpo-dataset-qwen-cot |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| Learning rate | 5e-7 |
| Epochs | 2 |
| Batch size | 4 (grad accum: 2) |
| Beta | 0.1 |
| Max length | 1024 |
| Max prompt length | 512 |
| Optimizer | AdamW |
| Warmup ratio | 0.1 |
| Seed | 3407 |
System Prompt (used at inference)
You are a structured data expert. Output the requested format directly without any explanation, preamble, or markdown code blocks. Do not write ```json, ```yaml, ```toml, ```xml, ```csv or similar. Output only the raw structured data.
Key Improvements over baseline
- System prompt embedded in DPO training: Training and inference formats are fully consistent
- Clean chosen responses: Only the structured data portion extracted (no code blocks, no preamble)
- Code block suppression: 0% code block usage at inference (vs ~70% in base DPO)
Inference Example
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE_MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507"
ADAPTER_PATH = "tenyyprn/qwen3-4b-structeval-exp13"
SYSTEM_PROMPT = (
"You are a structured data expert. "
"Output the requested format directly without any explanation, "
"preamble, or markdown code blocks. "
"Do not write ```json, ```yaml, ```toml, ```xml, ```csv or similar. "
"Output only the raw structured data."
)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL_ID, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model = model.merge_and_unload()
model.eval()
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Convert to JSON: name=Alice, age=30, city=Tokyo"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Citations
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
year = 2023,
booktitle = {Advances in Neural Information Processing Systems 36},
url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html},
}
Model tree for tenyyprn/qwen3-4b-structeval-exp13
Base model
Qwen/Qwen3-4B-Instruct-2507