Qwen2.5-Math-7B — Causal-Aware GRPO (EXP-010)

LoRA adapter trained with causal-aware Group Relative Policy Optimization (GRPO) on GSM8K.

Key Results

Metric Causal (this model) Baseline
Mean reward 0.394 0.255
Peak reward 0.560 0.400

Causal-aware reward augments outcome reward with a lightweight proxy for causal effectiveness, encouraging reasoning chains whose content demonstrably drives correct answers.

Training Details

  • Base model: Qwen/Qwen2.5-Math-7B-Instruct
  • Method: GRPO + causal-aware reward (λ=0.6)
  • Dataset: GSM8K (train split)
  • Steps: 500
  • Hardware: NVIDIA A40 (48GB)
  • Precision: bf16, 4-bit QLoRA

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")
model = PeftModel.from_pretrained(base, "resonancetech/qwen2.5-math-7b-causal-grpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Math-7B-Instruct")

Citation

Paper in preparation for NeurIPS 2026.

License

Apache 2.0 (following base model license)

Downloads last month
6
Video Preview
loading

Model tree for resonancetech/qwen2.5-math-7b-causal-grpo

Base model

Qwen/Qwen2.5-7B
Adapter
(327)
this model

Dataset used to train resonancetech/qwen2.5-math-7b-causal-grpo