celestialcreator's picture
Upload README.md with huggingface_hub
0a9c51e verified
metadata
language:
  - en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
tags:
  - reasoning
  - math
  - grpo
  - reinforcement-learning
  - rlvr
  - qwen3.5
datasets:
  - gsm8k
  - celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset
pipeline_tag: text-generation

Qwen3.5-0.8B-GRPO-Math

A reasoning-enhanced version of Qwen3.5-0.8B, trained using GRPO (Group Relative Policy Optimization) β€” the RL technique behind DeepSeek-R1 β€” on a single RTX 5090.

Results

Eval Setting GSM8K Accuracy Notes
Baseline 8-shot CoT 53.5% Pre-trained, no fine-tuning
Baseline zero-shot 52.1% Pre-trained, no fine-tuning
GRPO zero-shot 58.0% (+5.9pp) Best result β€” model reasons autonomously
GRPO 8-shot (plain format) 50.4% (-3.1pp) Few-shot examples conflict with learned policy
GRPO 8-shot (<think> aligned) 34.1% (-19.4pp) Format-aligned examples hurt even more

Key Finding: Demonstration to Policy Shift

GRPO training shifted the model from demonstration-based reasoning to policy-based reasoning.

After training, the model:

  • Performs best in zero-shot β€” it reasons autonomously using <think> tags
  • Is hurt by few-shot examples β€” any demonstrations conflict with its learned internal reasoning policy
  • Is hurt even more by format-aligned few-shot β€” <think> tags in examples caused the model to confuse context with its own generation, dropping to 34.1%

This is a behavioral shift, not a regression. The model no longer needs (or wants) demonstrations. This mirrors what DeepSeek-R1 demonstrated at 670B scale.

Training Pipeline

Phase 1: SFT Warmup

  • Data: 3,558 reasoning examples from 3 sources, standardized to <think> tags
  • Purpose: Solve the cold-start problem β€” teach the 0.8B model <think> tag format before RL exploration
  • Stats: 1 epoch, loss 0.932, 78% token accuracy

Phase 2: GRPO Training

  • Data: GSM8K train split (7,473 math word problems)
  • Rewards: Math correctness (1.0/0.0) + format reward (0.3 for <think> tags, 0.2 for #### answer)
  • Config: 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
  • Hardware: Single NVIDIA RTX 5090 (32GB VRAM)
  • Duration: ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued)

What is GRPO?

GRPO eliminates the need for a separate reward model and critic network (unlike PPO). For each prompt, it:

  1. Samples G completions from the policy
  2. Scores each with a verifiable reward (exact math answer checking)
  3. Normalizes rewards within the group (relative advantage)
  4. Updates the policy using a clipped surrogate objective

Only 2 models in memory (policy + reference) instead of 4 β€” feasible on consumer GPUs.

Lessons Learned

What worked

  • GRPO improved zero-shot reasoning β€” +5.9pp, model internalized step-by-step thinking
  • Demonstration to policy shift β€” the model developed its own reasoning strategy instead of relying on examples
  • Format + correctness rewards together β€” <think> tag bonus helped learn structured reasoning alongside accuracy
  • Single consumer GPU is viable β€” full pipeline on one RTX 5090

What we'd do differently

  • Eval after SFT β€” we skipped this, so we can't isolate SFT's contribution
  • Try GRPO without SFT β€” ablation would show if SFT warmup is necessary or trades few-shot ability for format
  • Larger model β€” 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+

Technical findings

  • Qwen3.5 DeltaNet needs FLA β€” install flash-linear-attention + causal-conv1d, otherwise torch fallback is ~10x slower
  • SDPA > FLA for inference β€” 3.6x faster first call. Use attn_implementation="sdpa"
  • Rewards plateau ~epoch 1.2 β€” diminishing returns beyond 2 epochs at this scale
  • RL-trained models are few-shot sensitive β€” even format-aligned examples hurt (34.1%), suggesting the model confuses example <think> tags with its own generation context

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "celestialcreator/Qwen3.5-0.8B-GRPO-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Best used in zero-shot β€” the model has its own reasoning policy
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
    {"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Note: This model performs best in zero-shot mode. Do not use few-shot examples β€” they conflict with the model's learned reasoning policy and reduce accuracy.

Training Code

Full pipeline (Dockerfile, k8s configs, scripts): github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning

Acknowledgments

  • TRL for the GRPOTrainer implementation
  • Qwen Team for the base model
  • DeepSeek for the GRPO algorithm

Citation

@misc{qwen35-grpo-math-2026,
  author = {Akshay Mhaskar},
  title = {Qwen3.5-0.8B-GRPO-Math: Teaching a Small Model to Reason with RL},
  year = {2026},
  url = {https://huggingface.co/celestialcreator/Qwen3.5-0.8B-GRPO-Math},
}