File size: 6,107 Bytes

6bc26de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a9c51e
 
 
 
 
 
 
6bc26de
0a9c51e
57b9a33
0a9c51e
 
 
 
 
 
 
 
6bc26de
 
 
 
57b9a33
 
6bc26de
 
 
 
 
0a9c51e
6bc26de
57b9a33
6bc26de
 
 
 
 
 
 
 
 
57b9a33
 
 
 
 
0a9c51e
 
57b9a33
0a9c51e
57b9a33
 
0a9c51e
57b9a33
0a9c51e
57b9a33
 
 
0a9c51e
 
 
6bc26de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a9c51e
6bc26de
 
 
 
 
 
 
 
 
 
 
0a9c51e
 
6bc26de
 
57b9a33
6bc26de

---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
tags:
- reasoning
- math
- grpo
- reinforcement-learning
- rlvr
- qwen3.5
datasets:
- gsm8k
- celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset
pipeline_tag: text-generation
---

# Qwen3.5-0.8B-GRPO-Math

A reasoning-enhanced version of [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B), trained using **GRPO (Group Relative Policy Optimization)** — the RL technique behind DeepSeek-R1 — on a single RTX 5090.

## Results

| Eval Setting | GSM8K Accuracy | Notes |
|---|:-:|---|
| Baseline 8-shot CoT | 53.5% | Pre-trained, no fine-tuning |
| Baseline zero-shot | 52.1% | Pre-trained, no fine-tuning |
| **GRPO zero-shot** | **58.0% (+5.9pp)** | Best result — model reasons autonomously |
| GRPO 8-shot (plain format) | 50.4% (-3.1pp) | Few-shot examples conflict with learned policy |
| GRPO 8-shot (`<think>` aligned) | 34.1% (-19.4pp) | Format-aligned examples hurt even more |

### Key Finding: Demonstration to Policy Shift

GRPO training shifted the model from **demonstration-based reasoning** to **policy-based reasoning**.

After training, the model:
- **Performs best in zero-shot** — it reasons autonomously using `<think>` tags
- **Is hurt by few-shot examples** — any demonstrations conflict with its learned internal reasoning policy
- **Is hurt even more by format-aligned few-shot** — `<think>` tags in examples caused the model to confuse context with its own generation, dropping to 34.1%

This is a behavioral shift, not a regression. The model no longer needs (or wants) demonstrations. This mirrors what DeepSeek-R1 demonstrated at 670B scale.

## Training Pipeline

### Phase 1: SFT Warmup
- **Data:** [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) from 3 sources, standardized to `<think>` tags
- **Purpose:** Solve the cold-start problem — teach the 0.8B model `<think>` tag format before RL exploration
- **Stats:** 1 epoch, loss 0.932, 78% token accuracy

### Phase 2: GRPO Training
- **Data:** GSM8K train split (7,473 math word problems)
- **Rewards:** Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer)
- **Config:** 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
- **Hardware:** Single NVIDIA RTX 5090 (32GB VRAM)
- **Duration:** ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued)

### What is GRPO?

GRPO eliminates the need for a separate reward model and critic network (unlike PPO). For each prompt, it:
1. Samples G completions from the policy
2. Scores each with a verifiable reward (exact math answer checking)
3. Normalizes rewards within the group (relative advantage)
4. Updates the policy using a clipped surrogate objective

Only 2 models in memory (policy + reference) instead of 4 — feasible on consumer GPUs.

## Lessons Learned

### What worked
- **GRPO improved zero-shot reasoning** — +5.9pp, model internalized step-by-step thinking
- **Demonstration to policy shift** — the model developed its own reasoning strategy instead of relying on examples
- **Format + correctness rewards together** — `<think>` tag bonus helped learn structured reasoning alongside accuracy
- **Single consumer GPU is viable** — full pipeline on one RTX 5090

### What we'd do differently
- **Eval after SFT** — we skipped this, so we can't isolate SFT's contribution
- **Try GRPO without SFT** — ablation would show if SFT warmup is necessary or trades few-shot ability for format
- **Larger model** — 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+

### Technical findings
- **Qwen3.5 DeltaNet needs FLA** — install `flash-linear-attention` + `causal-conv1d`, otherwise torch fallback is ~10x slower
- **SDPA > FLA for inference** — 3.6x faster first call. Use `attn_implementation="sdpa"`
- **Rewards plateau ~epoch 1.2** — diminishing returns beyond 2 epochs at this scale
- **RL-trained models are few-shot sensitive** — even format-aligned examples hurt (34.1%), suggesting the model confuses example `<think>` tags with its own generation context

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "celestialcreator/Qwen3.5-0.8B-GRPO-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Best used in zero-shot — the model has its own reasoning policy
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
    {"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

> **Note:** This model performs best in **zero-shot** mode. Do not use few-shot examples — they conflict with the model's learned reasoning policy and reduce accuracy.

## Training Code

Full pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning)

## Acknowledgments

- [TRL](https://github.com/huggingface/trl) for the GRPOTrainer implementation
- [Qwen Team](https://github.com/QwenLM) for the base model
- [DeepSeek](https://arxiv.org/abs/2402.03300) for the GRPO algorithm

## Citation

```bibtex
@misc{qwen35-grpo-math-2026,
  author = {Akshay Mhaskar},
  title = {Qwen3.5-0.8B-GRPO-Math: Teaching a Small Model to Reason with RL},
  year = {2026},
  url = {https://huggingface.co/celestialcreator/Qwen3.5-0.8B-GRPO-Math},
}
```