qwen3-4b-struct-evaluation-dpo-v3

This repository provides a LoRA adapter for Qwen/Qwen3-4B-Instruct-2507. It was developed by performing DPO (Direct Preference Optimization) on top of a previously SFT-tuned model: satoyutaka/llm2025_main_0.

This repository contains LoRA adapter weights only. The base model must be loaded separately.

Training Objective

This adapter is the 3rd iteration of the LLM2025Autumn "struct-eval-comp" project. Building upon the specialized SFT model, this DPO version further refines the model's ability to produce clean, structured outputs (JSON / CSV / TOML) by penalizing common formatting artifacts and conversational preambles.

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
Starting Adapter (SFT): satoyutaka/llm2025_main_0
Method: DPO (Direct Preference Optimization) via Unsloth
Samples: ~250 high-quality preference pairs
Learning rate: 2e-6
Beta: 0.05
LoRA Parameters: r=16, alpha=16
Target Modules: All major projections (q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj)

Data Preparation Process

SFT-to-DPO Pipeline: We first established a strong baseline with SFT, then used DPO to "polish" the output style, specifically targeting the elimination of Markdown code blocks and irrelevant text.
Preference Learning: Used (Chosen / Rejected) pairs to reinforce strict adherence to data formats without introductory filler.
Source: Based on u-10bei/dpo-dataset-qwen-cot.

Usage

from unsloth import FastLanguageModel
import torch

base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "satoyutaka/qwen3-4b-struct-evaluation-dpo-v3" # Replace with your actual repo ID

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base,
    max_seq_length = 2048,
    load_in_4bit = True,
)
model.load_adapter(adapter)

Sources & License (IMPORTANT)

Training Data: u-10bei/dpo-dataset-qwen-cot
License: Follows the original base model and dataset terms.

＜日本語訳＞

qwen3-4b-struct-evaluation-dpo-v3

このリポジトリは、Qwen/Qwen3-4B-Instruct-2507 をベースモデルとし、先にSFTを施した satoyutaka/llm2025_main_0 に対してさらに DPO (Direct Preference Optimization) を実行して作成された LoRA アダプターです。

【重要】本リポジトリには LoRA アダプターの重みのみ が含まれています。ベースモデルは別途ロードする必要があります。

学習の目的

本モデルは LLM2025Autumn「struct-eval-comp」 プロジェクトの第3試行モデルです。 SFTで培った構造化能力をベースに、DPOによって「よりクリーンで、余計な装飾のない出力」を好むように最適化しました。JSON / CSV / TOML 等の形式精度を最大化し、システムによる自動パースとの互換性を高めています。

学習設定

ベースモデル: Qwen/Qwen3-4B-Instruct-2507
ベースアダプター(SFT): satoyutaka/llm2025_main_0
手法: DPO (Direct Preference Optimization) / Unsloth使用
学習サンプル数: 約250件（厳選されたペアデータ）
学習率: 2e-6
Beta: 0.05

データ作成プロセス

SFTからの進化: SFTで学習済みのモデルに対し、さらに「好ましい回答（Chosen）」と「好ましくない回答（Rejected）」を比較学習させることで、バッククォートによる囲いや不要な挨拶文を徹底的に排除しました。
ノイズ除去: 思考過程（CoT）は保持しつつ、最終的な出力からフォーマット上のノイズを消し去るように調整されています。
元データの選定: u-10bei/dpo-dataset-qwen-cot を使用。

使い方

（英語セクションの Usage コードを参照してください）

ソースおよびライセンス

学習データ: u-10bei/dpo-dataset-qwen-cot
ライセンス: 元モデルおよびデータセットの規約に準拠します。

Downloads last month: 16

Model tree for satoyutaka/LLM2025_main_0_DPO3

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(1645)

this model