Qwen3-4B-DrGRPO-DoRA-StructEval-v1

This model implements Dr.GRPO (Direct Reward Group Relative Policy Optimization), an advanced RLVR technique combined with DoRA (Weight-Decomposed Low-Rank Adaptation).

🎯 Key Innovation: Dr.GRPO + DoRA

What is Dr.GRPO?

Dr.GRPO extends standard GRPO with direct verifiable rewards:

Traditional GRPO:

  • Generates multiple candidates per prompt (group)
  • Uses reward model to score candidates
  • Optimizes relative to group baseline

Dr.GRPO (Our Method):

  • Generates 2 candidates per prompt (group sampling)
  • Direct verification: JSON/XML/YAML validity checking
  • Group relative optimization: Rewards relative to group mean
  • No reward model needed: Uses automated format validation

Why DoRA for Dr.GRPO?

DoRA's weight decomposition is perfect for Dr.GRPO because:

  • Stable with diverse sampling: Group generation benefits from DoRA's direction learning
  • Efficient exploration: Lower rank (r=32) enables multiple candidates without overfitting
  • Fast convergence: DoRA + group rewards = 1 epoch sufficient

📊 Training Pipeline

Stage 1: SFT + DoRA

  • Data: 70% v5 (high-quality) + 30% Hard-Mix (complex)
  • Method: DoRA (r=32, alpha=64)
  • Output: Base structured output capability

Stage 2: Dr.GRPO + DoRA (This Model)

  • Data: DPO preference prompts (used for generation, not preferences)
  • Method: Group sampling (2 candidates/prompt) + direct verification
  • Output: Optimized for valid structured outputs

🔧 Training Configuration

Dr.GRPO Settings:

  • Learning rate: 3e-05 (DoRA-optimized)
  • Group size: 2 samples/prompt
  • Temperature: 0.7 (enables diversity)
  • KL coefficient: 0.05
  • Reward baseline: mean
  • Reward clipping: ±5.0

DoRA Settings:

  • Rank: 32 (optimal for DoRA)
  • Alpha: 64 (r * 2 ratio)
  • Dropout: 0 (DoRA standard)
  • Target modules: All attention + MLP

Optimization:

  • Epochs: 1
  • Batch size: 2 x 4 accumulation
  • Weight decay: 0.005 (DoRA-light)
  • Warmup steps: 20
  • Max grad norm: 0.5

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shion1124/dr-grpo-dora-qwen-struct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = "Convert to JSON: Name: Alice, Age: 25, City: Paris"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.0,  # Deterministic for production
    do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📈 Expected Performance

Compared to previous methods:

Method Score Range Key Benefit
SFT + DoRA 0.73-0.78 Base learning
DPO + DoRA 0.78-0.85 Preference learning
DAPO + DoRA 0.85-0.92 Single-sample verification
Dr.GRPO + DoRA 0.87-0.95 Group-based exploration

Dr.GRPO Advantages:

  • Explores multiple solutions per prompt
  • More robust to edge cases
  • Better handles ambiguous instructions

🔬 Technical Details

Reward Function:

reward = {
    'valid_json': 1.0,
    'valid_xml': 1.0,
    'valid_yaml': 1.0,
    'attempted_but_invalid': -0.3,
    'missing_output_marker': -0.8,
    'unknown_format': -0.5,
}

Group Relative Loss:

For each prompt:
1. Generate K=2 candidates
2. Compute reward r_i for each candidate
3. Baseline b = mean(r_1, ..., r_K)
4. Relative rewards: Δr_i = r_i - b
5. Loss = -mean(Δr_i)

KL Constraint:

Total_loss = Policy_loss + 0.05 × KL_divergence

📚 Training Data

  1. SFT: v5 + Hard-Mix datasets
  2. Dr.GRPO: u-10bei/dpo-dataset-qwen-cot (prompts only)

⚖️ License

  • Model: Apache 2.0
  • Datasets: MIT License
  • Comply with base model terms

Built with: Unsloth + Dr.GRPO + DoRA Best for: High-accuracy structured data generation with exploration

Downloads last month
30
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shion1124/dr-grpo-dora-qwen-struct

Finetuned
(326)
this model

Datasets used to train Shion1124/dr-grpo-dora-qwen-struct