Qwen2.5-Math-7B — Baseline GRPO (EXP-011)

LoRA adapter trained with standard Group Relative Policy Optimization (GRPO) on GSM8K. This is the baseline model (no causal-aware reward) for comparison with the causal-aware variant.

Key Results

Metric Causal (EXP-010) Baseline (this model)
Mean reward 0.394 0.255
Peak reward 0.560 0.400

Training Details

  • Base model: Qwen/Qwen2.5-Math-7B-Instruct
  • Method: GRPO (outcome + format reward only)
  • Dataset: GSM8K (train split)
  • Steps: 500
  • Hardware: NVIDIA A40 (48GB)
  • Precision: bf16, 4-bit QLoRA

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")
model = PeftModel.from_pretrained(base, "resonancetech/qwen2.5-math-7b-baseline-grpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")

Citation

Paper in preparation for NeurIPS 2026.

License

Apache 2.0 (following base model license)

Downloads last month
15
Video Preview
loading

Model tree for resonancetech/qwen2.5-math-7b-baseline-grpo

Base model

Qwen/Qwen2.5-7B
Adapter
(327)
this model

Dataset used to train resonancetech/qwen2.5-math-7b-baseline-grpo