Qwen2.5-Math-7B — Causal-Aware GRPO (EXP-010)

LoRA adapter trained with causal-aware Group Relative Policy Optimization (GRPO) on GSM8K.

Key Results

Metric	Causal (this model)	Baseline
Mean reward	0.394	0.255
Peak reward	0.560	0.400

Causal-aware reward augments outcome reward with a lightweight proxy for causal effectiveness, encouraging reasoning chains whose content demonstrably drives correct answers.

Training Details

Base model: Qwen/Qwen2.5-Math-7B-Instruct
Method: GRPO + causal-aware reward (λ=0.6)
Dataset: GSM8K (train split)
Steps: 500
Hardware: NVIDIA A40 (48GB)
Precision: bf16, 4-bit QLoRA

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")
model = PeftModel.from_pretrained(base, "resonancetech/qwen2.5-math-7b-causal-grpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")

Citation

Paper in preparation for NeurIPS 2026.

License

Apache 2.0 (following base model license)

Downloads last month: 6

Video Preview

Reinforcement Learning

Model tree for resonancetech/qwen2.5-math-7b-causal-grpo

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Math-7B

Finetuned

Qwen/Qwen2.5-Math-7B-Instruct

Adapter

(327)

this model

resonancetech
/

qwen2.5-math-7b-causal-grpo