DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper • 2402.03300 • Published • 142
A reasoning-enhanced version of Qwen3.5-0.8B, trained using GRPO (Group Relative Policy Optimization) — the RL technique behind DeepSeek-R1 — on a single RTX 5090 at Zosma AI.
Also available at: celestialcreator/Qwen3.5-0.8B-GRPO-Math
| Eval Setting | GSM8K Accuracy | Notes |
|---|---|---|
| Baseline 8-shot CoT | 53.5% | Pre-trained, no fine-tuning |
| Baseline zero-shot | 52.1% | Pre-trained, no fine-tuning |
| GRPO zero-shot | 58.0% (+5.9pp) | Best result — model reasons autonomously |
| GRPO 8-shot (plain format) | 50.4% (-3.1pp) | Few-shot examples conflict with learned policy |
GRPO 8-shot (<think> aligned) |
34.1% (-19.4pp) | Format-aligned examples hurt even more |
GRPO training shifted the model from demonstration-based reasoning to policy-based reasoning.
After training, the model:
<think> tags<think> tags in examples caused the model to confuse context with its own generation, dropping to 34.1%This mirrors what DeepSeek-R1 demonstrated at 670B scale.
<think> tags<think> tag format before RL exploration<think> tags, 0.2 for #### answer)from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "zosmaai/Qwen3.5-0.8B-GRPO-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# Best used in zero-shot — the model has its own reasoning policy
messages = [
{"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
{"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Note: This model performs best in zero-shot mode. Do not use few-shot examples — they conflict with the model's learned reasoning policy and reduce accuracy.
Full pipeline: github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning