qwen3-4b-struct-evaluation-dpo-v7

This repository provides a LoRA adapter for Qwen/Qwen3-4B-Instruct-2507. It represents the v7 iteration of the "struct-eval-comp" series, maintaining the high-tier score of 0.77136 with enhanced formatting stability.

This version was developed by performing specialized DPO (Direct Preference Optimization) on top of the v6 adapter, specifically targeting the stabilization of raw outputs over multiple training epochs.

Training Objective

The v7 adapter focuses on "Structural Persistence." Building on the logical foundations of v6, this version underwent extended training to further decouple the model's reasoning process (CoT) from its output format, ensuring that it provides deep analysis while strictly adhering to raw JSON/CSV/TOML formats without Markdown interference.

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
Starting Adapter: satoyutaka/qwen3-4b-struct-evaluation-dpo-v6 (0.77136 baseline)
Method: DPO (Direct Preference Optimization) via Unsloth
Samples: 250 logic-dense preference pairs (Refined via BERT scoring)
Learning rate: 8e-7 (Conservative rate to preserve reasoning)
Beta: 0.1
Epochs: 3 (Extended for format habituation)
LoRA Parameters: r=16, alpha=16
Target Modules: All major projections (q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj)

Data Preparation Process

Stable Logic Selection: Utilized BERT-based scoring to prioritize samples that exhibit high internal consistency between the "Approach" (reasoning) and the "Output" (data).
Persistent Formatting: Reinforced the "Chosen" pairs to be devoid of Markdown markers (```) across 3 epochs, pushing the model to view raw text as the only valid structural output.
Complexity Preservation: Maintained the Chain-of-Thought (Approach) to ensure that the model does not "skip" steps in complex data extraction tasks.

Usage

from unsloth import FastLanguageModel
import torch

base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "satoyutaka/LLM2025_main_0_DPO7"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base,
    max_seq_length = 2048,
    load_in_4bit = True,
)
model.load_adapter(adapter)

Sources & License (IMPORTANT)

Training Data: Based on u-10bei/dpo-dataset-qwen-cot
License: Follows the original base model and dataset terms.

＜日本語訳＞

qwen3-4b-struct-evaluation-dpo-v7

このリポジトリは、Qwen/Qwen3-4B-Instruct-2507 用のLoRAアダプターです。 struct-eval-comp プロジェクトにおける「V7」モデルであり、V6の最高スコア 0.77136 を維持しつつ、フォーマットの安定性を強化したモデルです。

V6アダプターをベースとし、複数エポックにわたるDPO（直接選好最適化）を施すことで、出力スタイルの定着を図っています。

学習の目的

V7の主な目的は「構造化の永続性」です。V6で培った高い論理性（CoT）を継承しつつ、さらに「思考は深く、出力は素っ気なく（Rawデータのみ）」という挙動を3エポックの学習で徹底させました。JSON / CSV / TOML 形式の抽出精度を保ちながら、Markdown等の装飾を徹底的に排除することに特化しています。

学習設定

ベースモデル: Qwen/Qwen3-4B-Instruct-2507
ベースアダプター: satoyutaka/qwen3-4b-struct-evaluation-dpo-v6 (0.77136 baseline)
手法: DPO (Direct Preference Optimization) / Unsloth使用
学習サンプル数: 250件（BERTによる論理密度スコアリングで厳選）
学習率: 8e-7 (推論能力維持のための慎重な設定)
Beta: 0.1
エポック数: 3 (スタイルの徹底定着のため)

データ作成プロセス

論理整合性による選別: BERTを用いて、思考過程（Approach）と最終出力の整合性が高いサンプルを優先的に採用しました。
Markdown排除の定着: 3エポックにわたる反復学習により、正解データ（Chosen）からバッククォート等の装飾を完全に排除するルールをモデルに強く学習させています。
推論プロセスの保持: 思考過程（Approach）を保持することで、複雑なデータ抽出タスクにおける精度低下を防止しています。

使い方

（英語セクションの Usage コードを参照してください。アダプターIDは satoyutaka/qwen3-4b-struct-evaluation-dpo-v7 に適宜読み替えてください）

ソースおよびライセンス

学習データ: u-10bei/dpo-dataset-qwen-cot をベースに独自加工
ライセンス: 元モデルおよびデータセットの規約に準拠します。

Downloads last month: 17

Model tree for satoyutaka/LLM2025_main_0_DPO7

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(1623)

this model