celestialcreator's picture
Upload README.md with huggingface_hub
0a9c51e verified
---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
tags:
- reasoning
- math
- grpo
- reinforcement-learning
- rlvr
- qwen3.5
datasets:
- gsm8k
- celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset
pipeline_tag: text-generation
---
# Qwen3.5-0.8B-GRPO-Math
A reasoning-enhanced version of [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B), trained using **GRPO (Group Relative Policy Optimization)** β€” the RL technique behind DeepSeek-R1 β€” on a single RTX 5090.
## Results
| Eval Setting | GSM8K Accuracy | Notes |
|---|:-:|---|
| Baseline 8-shot CoT | 53.5% | Pre-trained, no fine-tuning |
| Baseline zero-shot | 52.1% | Pre-trained, no fine-tuning |
| **GRPO zero-shot** | **58.0% (+5.9pp)** | Best result β€” model reasons autonomously |
| GRPO 8-shot (plain format) | 50.4% (-3.1pp) | Few-shot examples conflict with learned policy |
| GRPO 8-shot (`<think>` aligned) | 34.1% (-19.4pp) | Format-aligned examples hurt even more |
### Key Finding: Demonstration to Policy Shift
GRPO training shifted the model from **demonstration-based reasoning** to **policy-based reasoning**.
After training, the model:
- **Performs best in zero-shot** β€” it reasons autonomously using `<think>` tags
- **Is hurt by few-shot examples** β€” any demonstrations conflict with its learned internal reasoning policy
- **Is hurt even more by format-aligned few-shot** β€” `<think>` tags in examples caused the model to confuse context with its own generation, dropping to 34.1%
This is a behavioral shift, not a regression. The model no longer needs (or wants) demonstrations. This mirrors what DeepSeek-R1 demonstrated at 670B scale.
## Training Pipeline
### Phase 1: SFT Warmup
- **Data:** [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) from 3 sources, standardized to `<think>` tags
- **Purpose:** Solve the cold-start problem β€” teach the 0.8B model `<think>` tag format before RL exploration
- **Stats:** 1 epoch, loss 0.932, 78% token accuracy
### Phase 2: GRPO Training
- **Data:** GSM8K train split (7,473 math word problems)
- **Rewards:** Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer)
- **Config:** 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
- **Hardware:** Single NVIDIA RTX 5090 (32GB VRAM)
- **Duration:** ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued)
### What is GRPO?
GRPO eliminates the need for a separate reward model and critic network (unlike PPO). For each prompt, it:
1. Samples G completions from the policy
2. Scores each with a verifiable reward (exact math answer checking)
3. Normalizes rewards within the group (relative advantage)
4. Updates the policy using a clipped surrogate objective
Only 2 models in memory (policy + reference) instead of 4 β€” feasible on consumer GPUs.
## Lessons Learned
### What worked
- **GRPO improved zero-shot reasoning** β€” +5.9pp, model internalized step-by-step thinking
- **Demonstration to policy shift** β€” the model developed its own reasoning strategy instead of relying on examples
- **Format + correctness rewards together** β€” `<think>` tag bonus helped learn structured reasoning alongside accuracy
- **Single consumer GPU is viable** β€” full pipeline on one RTX 5090
### What we'd do differently
- **Eval after SFT** β€” we skipped this, so we can't isolate SFT's contribution
- **Try GRPO without SFT** β€” ablation would show if SFT warmup is necessary or trades few-shot ability for format
- **Larger model** β€” 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+
### Technical findings
- **Qwen3.5 DeltaNet needs FLA** β€” install `flash-linear-attention` + `causal-conv1d`, otherwise torch fallback is ~10x slower
- **SDPA > FLA for inference** β€” 3.6x faster first call. Use `attn_implementation="sdpa"`
- **Rewards plateau ~epoch 1.2** β€” diminishing returns beyond 2 epochs at this scale
- **RL-trained models are few-shot sensitive** β€” even format-aligned examples hurt (34.1%), suggesting the model confuses example `<think>` tags with its own generation context
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "celestialcreator/Qwen3.5-0.8B-GRPO-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# Best used in zero-shot β€” the model has its own reasoning policy
messages = [
{"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
{"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
> **Note:** This model performs best in **zero-shot** mode. Do not use few-shot examples β€” they conflict with the model's learned reasoning policy and reduce accuracy.
## Training Code
Full pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning)
## Acknowledgments
- [TRL](https://github.com/huggingface/trl) for the GRPOTrainer implementation
- [Qwen Team](https://github.com/QwenLM) for the base model
- [DeepSeek](https://arxiv.org/abs/2402.03300) for the GRPO algorithm
## Citation
```bibtex
@misc{qwen35-grpo-math-2026,
author = {Akshay Mhaskar},
title = {Qwen3.5-0.8B-GRPO-Math: Teaching a Small Model to Reason with RL},
year = {2026},
url = {https://huggingface.co/celestialcreator/Qwen3.5-0.8B-GRPO-Math},
}
```