Qwen3-4B-DAPO-DoRA-StructEval-v1
This model implements DAPO (Direct Alignment from Preference Optimization), an RLVR (Reinforcement Learning from Verifiable Rewards) approach, combined with DoRA (Weight-Decomposed Low-Rank Adaptation).
🎯 Key Innovation: DAPO + DoRA
What is DAPO?
DAPO extends DPO by incorporating verifiable rewards during training:
- Traditional DPO: Learns from preference pairs (chosen vs. rejected)
- DAPO: Adds automated verification of structured outputs (JSON/XML/YAML validity)
- Verification Weight: 30% of the loss signal comes from format validation
Why DoRA for DAPO?
DoRA's weight decomposition (magnitude + direction) is ideal for DAPO because:
- Stable learning with stronger reward signals
- Better convergence with verification-augmented loss
- Lower rank (r=32) achieves higher quality than standard LoRA
📊 Training Pipeline
Stage 1: SFT + DoRA
- Data: 70% v5 (high-quality) + 30% Hard-Mix (complex reasoning)
- Method: DoRA (r=32, alpha=64)
- Focus: Learn structured output generation with CoT masking
Stage 2: DAPO + DoRA (This Model)
- Data: DPO preference dataset with CoT reasoning
- Method: DAPO with 30% verification reward
- Focus: Align outputs to preferred structures + validate syntax
🔧 Training Configuration
DAPO Settings:
- Learning rate: 2e-05 (optimized for DoRA stability)
- Beta: 0.15 (preference strength)
- Verification weight: 0.3 (30% validation reward)
- Max sequence length: 1536
DoRA Settings:
- Rank: 32 (optimal for DoRA)
- Alpha: 64 (r * 2 ratio)
- Dropout: 0 (DoRA recommendation)
- Target modules: All attention + MLP layers
Optimization:
- Epochs: 1
- Batch size: 2 × 4 accumulation = 8 effective
- Weight decay: 0.005 (light for DoRA)
- Warmup ratio: 0.15 (DoRA stability)
🚀 Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Shion1124/dapo-dora-qwen-struct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
prompt = "Convert this to JSON: Name: Alice, Age: 30, City: Tokyo"
inputs = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📈 Expected Performance
Compared to base DPO:
- Format Accuracy: +5-10% (from verification rewards)
- Reasoning Quality: +3-7% (from DoRA stability)
- Overall Score: 0.85-0.92 on StructEval-T
📚 Training Data
SFT Stage:
- u-10bei/structured_data_with_cot_dataset_512_v5
- daichira/structured-hard-sft-4k
DAPO Stage:
- u-10bei/dpo-dataset-qwen-cot (preference pairs)
⚖️ License
- Model: Apache 2.0
- Dataset: MIT License (see original datasets)
- Users must comply with base model and dataset terms
🔬 Technical Details
Verifiable Rewards:
- JSON validation:
json.loads()success = 1.0 reward - XML validation:
ElementTree.fromstring()success = 1.0 reward - YAML validation:
yaml.safe_load()success = 1.0 reward - Partial credit: 0.3 for attempted format with errors
Loss Function:
DAPO_loss = (1 - α) × DPO_loss + α × Verification_penalty
where α = 0.3 (verification weight)
Built with ❤️ using Unsloth + DAPO + DoRA
- Downloads last month
- 30
Model tree for Shion1124/dapo-dora-qwen-struct
Base model
Qwen/Qwen3-4B-Instruct-2507
Finetuned
unsloth/Qwen3-4B-Instruct-2507