| --- |
| language: |
| - en |
| license: apache-2.0 |
| base_model: Qwen/Qwen3.5-0.8B |
| tags: |
| - reasoning |
| - math |
| - grpo |
| - reinforcement-learning |
| - rlvr |
| - qwen3.5 |
| datasets: |
| - gsm8k |
| - celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwen3.5-0.8B-GRPO-Math |
|
|
| A reasoning-enhanced version of [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B), trained using **GRPO (Group Relative Policy Optimization)** β the RL technique behind DeepSeek-R1 β on a single RTX 5090. |
|
|
| ## Results |
|
|
| | Eval Setting | GSM8K Accuracy | Notes | |
| |---|:-:|---| |
| | Baseline 8-shot CoT | 53.5% | Pre-trained, no fine-tuning | |
| | Baseline zero-shot | 52.1% | Pre-trained, no fine-tuning | |
| | **GRPO zero-shot** | **58.0% (+5.9pp)** | Best result β model reasons autonomously | |
| | GRPO 8-shot (plain format) | 50.4% (-3.1pp) | Few-shot examples conflict with learned policy | |
| | GRPO 8-shot (`<think>` aligned) | 34.1% (-19.4pp) | Format-aligned examples hurt even more | |
|
|
| ### Key Finding: Demonstration to Policy Shift |
|
|
| GRPO training shifted the model from **demonstration-based reasoning** to **policy-based reasoning**. |
|
|
| After training, the model: |
| - **Performs best in zero-shot** β it reasons autonomously using `<think>` tags |
| - **Is hurt by few-shot examples** β any demonstrations conflict with its learned internal reasoning policy |
| - **Is hurt even more by format-aligned few-shot** β `<think>` tags in examples caused the model to confuse context with its own generation, dropping to 34.1% |
|
|
| This is a behavioral shift, not a regression. The model no longer needs (or wants) demonstrations. This mirrors what DeepSeek-R1 demonstrated at 670B scale. |
|
|
| ## Training Pipeline |
|
|
| ### Phase 1: SFT Warmup |
| - **Data:** [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) from 3 sources, standardized to `<think>` tags |
| - **Purpose:** Solve the cold-start problem β teach the 0.8B model `<think>` tag format before RL exploration |
| - **Stats:** 1 epoch, loss 0.932, 78% token accuracy |
|
|
| ### Phase 2: GRPO Training |
| - **Data:** GSM8K train split (7,473 math word problems) |
| - **Rewards:** Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer) |
| - **Config:** 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04 |
| - **Hardware:** Single NVIDIA RTX 5090 (32GB VRAM) |
| - **Duration:** ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued) |
|
|
| ### What is GRPO? |
|
|
| GRPO eliminates the need for a separate reward model and critic network (unlike PPO). For each prompt, it: |
| 1. Samples G completions from the policy |
| 2. Scores each with a verifiable reward (exact math answer checking) |
| 3. Normalizes rewards within the group (relative advantage) |
| 4. Updates the policy using a clipped surrogate objective |
|
|
| Only 2 models in memory (policy + reference) instead of 4 β feasible on consumer GPUs. |
|
|
| ## Lessons Learned |
|
|
| ### What worked |
| - **GRPO improved zero-shot reasoning** β +5.9pp, model internalized step-by-step thinking |
| - **Demonstration to policy shift** β the model developed its own reasoning strategy instead of relying on examples |
| - **Format + correctness rewards together** β `<think>` tag bonus helped learn structured reasoning alongside accuracy |
| - **Single consumer GPU is viable** β full pipeline on one RTX 5090 |
|
|
| ### What we'd do differently |
| - **Eval after SFT** β we skipped this, so we can't isolate SFT's contribution |
| - **Try GRPO without SFT** β ablation would show if SFT warmup is necessary or trades few-shot ability for format |
| - **Larger model** β 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+ |
|
|
| ### Technical findings |
| - **Qwen3.5 DeltaNet needs FLA** β install `flash-linear-attention` + `causal-conv1d`, otherwise torch fallback is ~10x slower |
| - **SDPA > FLA for inference** β 3.6x faster first call. Use `attn_implementation="sdpa"` |
| - **Rewards plateau ~epoch 1.2** β diminishing returns beyond 2 epochs at this scale |
| - **RL-trained models are few-shot sensitive** β even format-aligned examples hurt (34.1%), suggesting the model confuses example `<think>` tags with its own generation context |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "celestialcreator/Qwen3.5-0.8B-GRPO-Math" |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| # Best used in zero-shot β the model has its own reasoning policy |
| messages = [ |
| {"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"}, |
| {"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"}, |
| ] |
| |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False) |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| > **Note:** This model performs best in **zero-shot** mode. Do not use few-shot examples β they conflict with the model's learned reasoning policy and reduce accuracy. |
|
|
| ## Training Code |
|
|
| Full pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning) |
|
|
| ## Acknowledgments |
|
|
| - [TRL](https://github.com/huggingface/trl) for the GRPOTrainer implementation |
| - [Qwen Team](https://github.com/QwenLM) for the base model |
| - [DeepSeek](https://arxiv.org/abs/2402.03300) for the GRPO algorithm |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{qwen35-grpo-math-2026, |
| author = {Akshay Mhaskar}, |
| title = {Qwen3.5-0.8B-GRPO-Math: Teaching a Small Model to Reason with RL}, |
| year = {2026}, |
| url = {https://huggingface.co/celestialcreator/Qwen3.5-0.8B-GRPO-Math}, |
| } |
| ``` |
|
|