Upload README.md with huggingface_hub

0a9c51e verified 24 days ago

6.11 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: Qwen/Qwen3.5-0.8B
	tags:
	- reasoning
	- math
	- grpo
	- reinforcement-learning
	- rlvr
	- qwen3.5
	datasets:
	- gsm8k
	- celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset
	pipeline_tag: text-generation
	---

	# Qwen3.5-0.8B-GRPO-Math

	A reasoning-enhanced version of [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B), trained using GRPO (Group Relative Policy Optimization) — the RL technique behind DeepSeek-R1 — on a single RTX 5090.

	## Results

	\| Eval Setting \| GSM8K Accuracy \| Notes \|
	\|---\|:-:\|---\|
	\| Baseline 8-shot CoT \| 53.5% \| Pre-trained, no fine-tuning \|
	\| Baseline zero-shot \| 52.1% \| Pre-trained, no fine-tuning \|
	\| GRPO zero-shot \| 58.0% (+5.9pp) \| Best result — model reasons autonomously \|
	\| GRPO 8-shot (plain format) \| 50.4% (-3.1pp) \| Few-shot examples conflict with learned policy \|
	\| GRPO 8-shot (`<think>` aligned) \| 34.1% (-19.4pp) \| Format-aligned examples hurt even more \|

	### Key Finding: Demonstration to Policy Shift

	GRPO training shifted the model from demonstration-based reasoning to policy-based reasoning.

	After training, the model:
	- Performs best in zero-shot — it reasons autonomously using `<think>` tags
	- Is hurt by few-shot examples — any demonstrations conflict with its learned internal reasoning policy
	- Is hurt even more by format-aligned few-shot — `<think>` tags in examples caused the model to confuse context with its own generation, dropping to 34.1%

	This is a behavioral shift, not a regression. The model no longer needs (or wants) demonstrations. This mirrors what DeepSeek-R1 demonstrated at 670B scale.

	## Training Pipeline

	### Phase 1: SFT Warmup
	- Data: [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) from 3 sources, standardized to `<think>` tags
	- Purpose: Solve the cold-start problem — teach the 0.8B model `<think>` tag format before RL exploration
	- Stats: 1 epoch, loss 0.932, 78% token accuracy

	### Phase 2: GRPO Training
	- Data: GSM8K train split (7,473 math word problems)
	- Rewards: Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer)
	- Config: 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
	- Hardware: Single NVIDIA RTX 5090 (32GB VRAM)
	- Duration: ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued)

	### What is GRPO?

	GRPO eliminates the need for a separate reward model and critic network (unlike PPO). For each prompt, it:
	1. Samples G completions from the policy
	2. Scores each with a verifiable reward (exact math answer checking)
	3. Normalizes rewards within the group (relative advantage)
	4. Updates the policy using a clipped surrogate objective

	Only 2 models in memory (policy + reference) instead of 4 — feasible on consumer GPUs.

	## Lessons Learned

	### What worked
	- GRPO improved zero-shot reasoning — +5.9pp, model internalized step-by-step thinking
	- Demonstration to policy shift — the model developed its own reasoning strategy instead of relying on examples
	- Format + correctness rewards together — `<think>` tag bonus helped learn structured reasoning alongside accuracy
	- Single consumer GPU is viable — full pipeline on one RTX 5090

	### What we'd do differently
	- Eval after SFT — we skipped this, so we can't isolate SFT's contribution
	- Try GRPO without SFT — ablation would show if SFT warmup is necessary or trades few-shot ability for format
	- Larger model — 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+

	### Technical findings
	- Qwen3.5 DeltaNet needs FLA — install `flash-linear-attention` + `causal-conv1d`, otherwise torch fallback is ~10x slower
	- SDPA > FLA for inference — 3.6x faster first call. Use `attn_implementation="sdpa"`
	- Rewards plateau ~epoch 1.2 — diminishing returns beyond 2 epochs at this scale
	- RL-trained models are few-shot sensitive — even format-aligned examples hurt (34.1%), suggesting the model confuses example `<think>` tags with its own generation context

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "celestialcreator/Qwen3.5-0.8B-GRPO-Math"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)

	# Best used in zero-shot — the model has its own reasoning policy
	messages = [
	{"role": "system", "content": "You are a helpful assistant that thinks step by step. Show your reasoning inside <think> tags before giving your final answer. End math answers with: #### <number>"},
	{"role": "user", "content": "If a train travels at 60 mph for 2.5 hours, how far does it go?"},
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	> Note: This model performs best in zero-shot mode. Do not use few-shot examples — they conflict with the model's learned reasoning policy and reduce accuracy.

	## Training Code

	Full pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning)

	## Acknowledgments

	- [TRL](https://github.com/huggingface/trl) for the GRPOTrainer implementation
	- [Qwen Team](https://github.com/QwenLM) for the base model
	- [DeepSeek](https://arxiv.org/abs/2402.03300) for the GRPO algorithm

	## Citation

	```bibtex
	@misc{qwen35-grpo-math-2026,
	author = {Akshay Mhaskar},
	title = {Qwen3.5-0.8B-GRPO-Math: Teaching a Small Model to Reason with RL},
	year = {2026},
	url = {https://huggingface.co/celestialcreator/Qwen3.5-0.8B-GRPO-Math},
	}
	```