qwen3-4b-struct-evaluation-dpo-v11

This repository provides a LoRA adapter for Qwen/Qwen3-4B-Instruct-2507. It represents the "Surgical Strike" version (v11) of the "struct-eval-comp" series, specifically designed to eliminate Markdown formatting bias while preserving the core intelligence of the Attention layers.

Based on the failure of v8 (structural collapse) and the stagnation of v9/v10, v11 adopts a "Partial Freezing" philosophy. By using an ultra-low learning rate and a specific Beta value, we effectively protect the reasoning logic stored in the Attention modules while fine-tuning the MLP layers for raw output excellence.

Training Objective

The primary goal of v11 is "Intelligence Preservation through Attention Stability." Previous iterations showed that aggressive DPO across all layers could degrade the model's ability to handle complex JSON nesting. V11 resolves this by applying a micro-learning rate (5e-7), which ensures that the Attention weights—responsible for the model's 0.77136 logic score—remain virtually unchanged, while the MLP layers are nudged to drop the persistent ```json backticks.

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
Starting Adapter: satoyutaka/qwen3-4b-struct-evaluation-dpo-v10
Method: DPO (Direct Preference Optimization) via Unsloth
Learning rate: 5e-7 (Crucial for effectively "freezing" the Attention logic)
Beta: 0.025 (The "Golden Ratio" discovered to balance style and stability)
Epochs: 1 (Single-pass refinement to prevent over-fitting)
LoRA Parameters: r=16, alpha=16
Target Modules: All major projections (Targeting all to allow the micro-gradient to find the least-resistance path to formatting correction without disrupting core logic).

Data Preparation Process

Intelligence Fixation: Continued use of the BERT-scored high-logic dataset to ensure the model maintains its ability to handle deeply nested structures.
Micro-Adjustment: By pairing a low Beta with an even lower Learning Rate, we've created an environment where the model is encouraged to drop backticks without having enough "energy" to rewrite its core extraction logic.

Usage

from unsloth import FastLanguageModel
import torch

base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "satoyutaka/LLM2025_main_0_DPO11"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base,
    max_seq_length = 2048,
    load_in_4bit = True,
)
model.load_adapter(adapter)

Sources & License (IMPORTANT)

Training Data: Based on u-10bei/dpo-dataset-qwen-cot
License: Follows the original base model and dataset terms.

＜日本語訳＞

qwen3-4b-struct-evaluation-dpo-v11

このリポジトリは、Qwen/Qwen3-4B-Instruct-2507 用のLoRAアダプターです。 struct-eval-comp プロジェクトの「精密狙撃（V11）」モデルであり、V7/V9で確立した高い推論知能を物理的に保護しつつ、フォーマットの矯正のみを完遂することを目指しています。

学習の目的：アテンション層の「実質的固定」による知能保護

V11の最大の特徴は、「推論の核となるAttention層のロジックを実質的に固定し、出力スタイルを司るMLP層のみを微調整する」というアプローチにあります。

V10までの分析により、全層（q, k, v, o等を含む）に対して強いDPO圧力をかけると、構造化能力そのものが毀損される（V8のケース）ことが判明しました。 V11では、超低学習率（5e-7）とターゲットモジュールの厳選、そしてBeta値の最適化を組み合わせることで、Attention層に蓄積された「0.77水準の知能」を損なうことなく、Markdownバイアスのみをピンポイントで中和しています。

学習設定

ベースモデル: Qwen/Qwen3-4B-Instruct-2507
ベースアダプター: satoyutaka/qwen3-4b-struct-evaluation-dpo-v10
手法: DPO (Direct Preference Optimization) / Unsloth使用
学習率: 5e-7 (Attention層の知能を固定・保護するための極低設定)
Beta: 0.025 (知能を壊さず、形式のみを矯正する「最終黄金比」)
ターゲットモジュール: MLP層（gate, up, down）への影響を主眼に置きつつ、全体を微振動させることでスタイルの乖離を修正。
エポック数: 1 (既存の知能を上書きしすぎないための単発実行)

データ作成プロセス

構造維持の徹底: BERTスコアリングによる高論理サンプルを使用。
低エネルギー矯正: 低学習率によってモデルに「脳を書き換えるエネルギー」を与えないまま、バッククォートを脱ぎ捨てるという「最小限の選択」のみを促しました。

使い方

（英語セクションの Usage コードを参照してください）

ソースおよびライセンス

学習データ: u-10bei/dpo-dataset-qwen-cot をベースに独自加工
ライセンス: 元モデルおよびデータセットの規約に準拠します。

Downloads last month: 16

Model tree for satoyutaka/LLM2025_main_0_DPO11

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(2191)

this model