Qwen3.5-0.8B-GRPO-Math

A reasoning-enhanced version of Qwen3.5-0.8B, trained using GRPO (Group Relative Policy Optimization) — the RL technique behind DeepSeek-R1 — on a single RTX 5090 at Zosma AI.

Also available at: celestialcreator/Qwen3.5-0.8B-GRPO-Math

Results

Eval Setting	GSM8K Accuracy	Notes
Baseline 8-shot CoT	53.5%	Pre-trained, no fine-tuning
Baseline zero-shot	52.1%	Pre-trained, no fine-tuning
GRPO zero-shot	58.0% (+5.9pp)	Best result — model reasons autonomously
GRPO 8-shot (plain format)	50.4% (-3.1pp)	Few-shot examples conflict with learned policy
GRPO 8-shot (`<think>` aligned)	34.1% (-19.4pp)	Format-aligned examples hurt even more

Key Finding: Demonstration to Policy Shift

GRPO training shifted the model from demonstration-based reasoning to policy-based reasoning.

After training, the model:

Performs best in zero-shot — it reasons autonomously using <think> tags
Is hurt by few-shot examples — any demonstrations conflict with its learned internal reasoning policy
Is hurt even more by format-aligned few-shot — <think> tags in examples caused the model to confuse context with its own generation, dropping to 34.1%

This mirrors what DeepSeek-R1 demonstrated at 670B scale.

Training Pipeline

Phase 1: SFT Warmup

Data: 3,558 reasoning examples from 3 sources, standardized to <think> tags
Purpose: Solve the cold-start problem — teach the 0.8B model <think> tag format before RL exploration
Stats: 1 epoch, loss 0.932, 78% token accuracy

Phase 2: GRPO Training

Data: GSM8K train split (7,473 math word problems)
Rewards: Math correctness (1.0/0.0) + format reward (0.3 for <think> tags, 0.2 for #### answer)
Config: 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
Hardware: Single NVIDIA RTX 5090 (32GB VRAM)
Duration: ~77 hours, 15,900 steps (epoch 2.13)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zosmaai/Qwen3.5-0.8B-GRPO-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Best used in zero-shot — the model has its own reasoning policy
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
    {"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Note: This model performs best in zero-shot mode. Do not use few-shot examples — they conflict with the model's learned reasoning policy and reduce accuracy.