Qwen3-4B-DrGRPO-DoRA-StructEval-v1
This model implements Dr.GRPO (Direct Reward Group Relative Policy Optimization), an advanced RLVR technique combined with DoRA (Weight-Decomposed Low-Rank Adaptation).
🎯 Key Innovation: Dr.GRPO + DoRA
What is Dr.GRPO?
Dr.GRPO extends standard GRPO with direct verifiable rewards:
Traditional GRPO:
- Generates multiple candidates per prompt (group)
- Uses reward model to score candidates
- Optimizes relative to group baseline
Dr.GRPO (Our Method):
- Generates 2 candidates per prompt (group sampling)
- Direct verification: JSON/XML/YAML validity checking
- Group relative optimization: Rewards relative to group mean
- No reward model needed: Uses automated format validation
Why DoRA for Dr.GRPO?
DoRA's weight decomposition is perfect for Dr.GRPO because:
- Stable with diverse sampling: Group generation benefits from DoRA's direction learning
- Efficient exploration: Lower rank (r=32) enables multiple candidates without overfitting
- Fast convergence: DoRA + group rewards = 1 epoch sufficient
📊 Training Pipeline
Stage 1: SFT + DoRA
- Data: 70% v5 (high-quality) + 30% Hard-Mix (complex)
- Method: DoRA (r=32, alpha=64)
- Output: Base structured output capability
Stage 2: Dr.GRPO + DoRA (This Model)
- Data: DPO preference prompts (used for generation, not preferences)
- Method: Group sampling (2 candidates/prompt) + direct verification
- Output: Optimized for valid structured outputs
🔧 Training Configuration
Dr.GRPO Settings:
- Learning rate: 3e-05 (DoRA-optimized)
- Group size: 2 samples/prompt
- Temperature: 0.7 (enables diversity)
- KL coefficient: 0.05
- Reward baseline: mean
- Reward clipping: ±5.0
DoRA Settings:
- Rank: 32 (optimal for DoRA)
- Alpha: 64 (r * 2 ratio)
- Dropout: 0 (DoRA standard)
- Target modules: All attention + MLP
Optimization:
- Epochs: 1
- Batch size: 2 x 4 accumulation
- Weight decay: 0.005 (DoRA-light)
- Warmup steps: 20
- Max grad norm: 0.5
🚀 Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Shion1124/dr-grpo-dora-qwen-struct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
prompt = "Convert to JSON: Name: Alice, Age: 25, City: Paris"
inputs = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=512,
temperature=0.0, # Deterministic for production
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
📈 Expected Performance
Compared to previous methods:
| Method | Score Range | Key Benefit |
|---|---|---|
| SFT + DoRA | 0.73-0.78 | Base learning |
| DPO + DoRA | 0.78-0.85 | Preference learning |
| DAPO + DoRA | 0.85-0.92 | Single-sample verification |
| Dr.GRPO + DoRA | 0.87-0.95 | Group-based exploration |
Dr.GRPO Advantages:
- Explores multiple solutions per prompt
- More robust to edge cases
- Better handles ambiguous instructions
🔬 Technical Details
Reward Function:
reward = {
'valid_json': 1.0,
'valid_xml': 1.0,
'valid_yaml': 1.0,
'attempted_but_invalid': -0.3,
'missing_output_marker': -0.8,
'unknown_format': -0.5,
}
Group Relative Loss:
For each prompt:
1. Generate K=2 candidates
2. Compute reward r_i for each candidate
3. Baseline b = mean(r_1, ..., r_K)
4. Relative rewards: Δr_i = r_i - b
5. Loss = -mean(Δr_i)
KL Constraint:
Total_loss = Policy_loss + 0.05 × KL_divergence
📚 Training Data
- SFT: v5 + Hard-Mix datasets
- Dr.GRPO: u-10bei/dpo-dataset-qwen-cot (prompts only)
⚖️ License
- Model: Apache 2.0
- Datasets: MIT License
- Comply with base model terms
Built with: Unsloth + Dr.GRPO + DoRA Best for: High-accuracy structured data generation with exploration
- Downloads last month
- 30
Model tree for Shion1124/dr-grpo-dora-qwen-struct
Base model
Qwen/Qwen3-4B-Instruct-2507
Finetuned
unsloth/Qwen3-4B-Instruct-2507