Qwen3-4B-DAPO-DoRA-StructEval-v1

This model implements DAPO (Direct Alignment from Preference Optimization), an RLVR (Reinforcement Learning from Verifiable Rewards) approach, combined with DoRA (Weight-Decomposed Low-Rank Adaptation).

🎯 Key Innovation: DAPO + DoRA

What is DAPO?

DAPO extends DPO by incorporating verifiable rewards during training:

  • Traditional DPO: Learns from preference pairs (chosen vs. rejected)
  • DAPO: Adds automated verification of structured outputs (JSON/XML/YAML validity)
  • Verification Weight: 30% of the loss signal comes from format validation

Why DoRA for DAPO?

DoRA's weight decomposition (magnitude + direction) is ideal for DAPO because:

  • Stable learning with stronger reward signals
  • Better convergence with verification-augmented loss
  • Lower rank (r=32) achieves higher quality than standard LoRA

📊 Training Pipeline

Stage 1: SFT + DoRA

  • Data: 70% v5 (high-quality) + 30% Hard-Mix (complex reasoning)
  • Method: DoRA (r=32, alpha=64)
  • Focus: Learn structured output generation with CoT masking

Stage 2: DAPO + DoRA (This Model)

  • Data: DPO preference dataset with CoT reasoning
  • Method: DAPO with 30% verification reward
  • Focus: Align outputs to preferred structures + validate syntax

🔧 Training Configuration

DAPO Settings:

  • Learning rate: 2e-05 (optimized for DoRA stability)
  • Beta: 0.15 (preference strength)
  • Verification weight: 0.3 (30% validation reward)
  • Max sequence length: 1536

DoRA Settings:

  • Rank: 32 (optimal for DoRA)
  • Alpha: 64 (r * 2 ratio)
  • Dropout: 0 (DoRA recommendation)
  • Target modules: All attention + MLP layers

Optimization:

  • Epochs: 1
  • Batch size: 2 × 4 accumulation = 8 effective
  • Weight decay: 0.005 (light for DoRA)
  • Warmup ratio: 0.15 (DoRA stability)

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shion1124/dapo-dora-qwen-struct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Convert this to JSON: Name: Alice, Age: 30, City: Tokyo"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📈 Expected Performance

Compared to base DPO:

  • Format Accuracy: +5-10% (from verification rewards)
  • Reasoning Quality: +3-7% (from DoRA stability)
  • Overall Score: 0.85-0.92 on StructEval-T

📚 Training Data

  1. SFT Stage:

    • u-10bei/structured_data_with_cot_dataset_512_v5
    • daichira/structured-hard-sft-4k
  2. DAPO Stage:

    • u-10bei/dpo-dataset-qwen-cot (preference pairs)

⚖️ License

  • Model: Apache 2.0
  • Dataset: MIT License (see original datasets)
  • Users must comply with base model and dataset terms

🔬 Technical Details

Verifiable Rewards:

  • JSON validation: json.loads() success = 1.0 reward
  • XML validation: ElementTree.fromstring() success = 1.0 reward
  • YAML validation: yaml.safe_load() success = 1.0 reward
  • Partial credit: 0.3 for attempted format with errors

Loss Function:

DAPO_loss = (1 - α) × DPO_loss + α × Verification_penalty
where α = 0.3 (verification weight)

Built with ❤️ using Unsloth + DAPO + DoRA

Downloads last month
30
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shion1124/dapo-dora-qwen-struct

Finetuned
(303)
this model

Datasets used to train Shion1124/dapo-dora-qwen-struct