Qwen2.5-Math-7B — Baseline GRPO (EXP-011)

LoRA adapter trained with standard Group Relative Policy Optimization (GRPO) on GSM8K. This is the baseline model (no causal-aware reward) for comparison with the causal-aware variant.

Key Results

Metric	Causal (EXP-010)	Baseline (this model)
Mean reward	0.394	0.255
Peak reward	0.560	0.400

Training Details

Base model: Qwen/Qwen2.5-Math-7B-Instruct
Method: GRPO (outcome + format reward only)
Dataset: GSM8K (train split)
Steps: 500
Hardware: NVIDIA A40 (48GB)
Precision: bf16, 4-bit QLoRA

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")
model = PeftModel.from_pretrained(base, "resonancetech/qwen2.5-math-7b-baseline-grpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")

Citation

Paper in preparation for NeurIPS 2026.

License

Apache 2.0 (following base model license)

Downloads last month: 15

Video Preview

Reinforcement Learning

Model tree for resonancetech/qwen2.5-math-7b-baseline-grpo

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Math-7B

Finetuned

Qwen/Qwen2.5-Math-7B-Instruct

Adapter

(327)

this model

resonancetech
/

qwen2.5-math-7b-baseline-grpo