celestialcreator commited on
Commit
57b9a33
Β·
verified Β·
1 Parent(s): 6bc26de

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +34 -18
README.md CHANGED
@@ -22,26 +22,32 @@ A reasoning-enhanced version of [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3
22
 
23
  ## Results
24
 
25
- | Model | GSM8K 8-shot CoT | GSM8K Zero-shot |
26
- |-------|:-----------------:|:---------------:|
27
- | Qwen3.5-0.8B (baseline) | 53.5% | 52.1% |
28
- | **Qwen3.5-0.8B-GRPO-Math** | 50.4% | **58.0% (+5.9pp)** |
 
29
 
30
- The model was trained to reason using `<think>` tags. **Zero-shot performance improved by +5.9 percentage points** because the model internalized step-by-step reasoning β€” it no longer needs few-shot examples. The 8-shot drop is expected: few-shot examples conflict with the model's learned reasoning format.
 
 
31
 
32
  ## Training Pipeline
33
 
34
  ### Phase 1: SFT Warmup
35
- - **Data:** [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) (Claude-generated math chains + community reasoning data)
36
- - **Purpose:** Teach the model `<think>` tag format before RL
 
 
 
37
  - **Stats:** 1 epoch, loss 0.932, 78% token accuracy
38
 
39
  ### Phase 2: GRPO Training
40
  - **Data:** GSM8K train split (7,473 math word problems)
41
  - **Rewards:** Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer)
42
- - **Config:** 8 generations/prompt, batch size 1 x 8 grad accum, lr 1e-6, beta=0.04
43
  - **Hardware:** Single NVIDIA RTX 5090 (32GB VRAM)
44
- - **Duration:** ~77 hours, 15,900 steps (epoch 2.13)
45
 
46
  ### What is GRPO?
47
 
@@ -51,7 +57,24 @@ GRPO eliminates the need for a separate reward model and critic network (unlike
51
  3. Normalizes rewards within the group (relative advantage)
52
  4. Updates the policy using a clipped surrogate objective
53
 
54
- This means only 2 models in memory (policy + reference) instead of 4, making it feasible on consumer GPUs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Usage
57
 
@@ -80,14 +103,7 @@ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_t
80
 
81
  ## Training Code
82
 
83
- Full training pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning)
84
-
85
- ## Key Findings
86
-
87
- - **Qwen3.5-0.8B uses DeltaNet** (hybrid Gated DeltaNet + Gated Attention layers). Install `flash-linear-attention` + `causal-conv1d` for fast generation.
88
- - **SDPA > FLA for inference** β€” 3.6x faster on first call. Use `attn_implementation="sdpa"`.
89
- - **Zero-shot is the right eval for RL-trained reasoning models** β€” few-shot examples conflict with learned reasoning patterns.
90
- - **0.8B is near the capacity ceiling for GRPO** β€” the model internalizes reasoning format but has limited room for math accuracy gains. Consider 1.5B+ for stronger results.
91
 
92
  ## Acknowledgments
93
 
 
22
 
23
  ## Results
24
 
25
+ | Model | GSM8K 8-shot CoT | GSM8K Zero-shot | Notes |
26
+ |-------|:-----------------:|:---------------:|-------|
27
+ | Qwen3.5-0.8B (baseline) | 53.5% | 52.1% | Pre-trained, no fine-tuning |
28
+ | + SFT only | not evaluated | not evaluated | Teaches `<think>` format |
29
+ | **+ SFT + GRPO (ours)** | 50.4% (-3.1pp) | **58.0% (+5.9pp)** | Internalized reasoning |
30
 
31
+ **Zero-shot improved by +5.9pp** β€” the model internalized step-by-step reasoning and no longer needs few-shot examples. The 8-shot drop (-3.1pp) is because few-shot examples conflict with the model's learned `<think>` tag format.
32
+
33
+ > **Honest note:** We did not evaluate the SFT-only checkpoint, so we cannot isolate whether the 8-shot regression came from SFT (format overwrite) or GRPO. See [Lessons Learned](#lessons-learned).
34
 
35
  ## Training Pipeline
36
 
37
  ### Phase 1: SFT Warmup
38
+ - **Data:** [3,558 reasoning examples](https://huggingface.co/datasets/celestialcreator/Qwen3.5-0.8B-GRPO-Math-Dataset) from 3 sources, standardized to `<think>` tags
39
+ - 1K Claude Sonnet math chains (originally `<thought>` β†’ converted to `<think>`)
40
+ - ~250 from TeichAI Opus reasoning (already `<think>`)
41
+ - ~2.3K from Opus 4.6 Reasoning (thinking/solution β†’ combined into `<think>`)
42
+ - **Purpose:** Solve the cold-start problem β€” teach the 0.8B model `<think>` tag format before RL exploration
43
  - **Stats:** 1 epoch, loss 0.932, 78% token accuracy
44
 
45
  ### Phase 2: GRPO Training
46
  - **Data:** GSM8K train split (7,473 math word problems)
47
  - **Rewards:** Math correctness (1.0/0.0) + format reward (0.3 for `<think>` tags, 0.2 for `####` answer)
48
+ - **Config:** 8 generations/prompt, batch size 1 Γ— 8 grad accum, lr 1e-6, Ξ²=0.04
49
  - **Hardware:** Single NVIDIA RTX 5090 (32GB VRAM)
50
+ - **Duration:** ~77 hours, 15,900 steps (epoch 2.13, rewards had plateaued)
51
 
52
  ### What is GRPO?
53
 
 
57
  3. Normalizes rewards within the group (relative advantage)
58
  4. Updates the policy using a clipped surrogate objective
59
 
60
+ Only 2 models in memory (policy + reference) instead of 4 β€” feasible on consumer GPUs.
61
+
62
+ ## Lessons Learned
63
+
64
+ ### What worked
65
+ - **GRPO improved zero-shot reasoning** β€” +5.9pp on GSM8K, model internalized step-by-step thinking
66
+ - **Format + correctness rewards together** β€” `<think>` tag bonus helped learn structured reasoning alongside accuracy
67
+ - **Single consumer GPU is viable** β€” full pipeline on one RTX 5090 with ~10-12 GB of 32 GB used
68
+
69
+ ### What we'd do differently
70
+ - **Eval after SFT** β€” we skipped this, so we can't isolate SFT's contribution. Critical missing data point.
71
+ - **Try GRPO without SFT** β€” ablation would show if SFT warmup is necessary or trades few-shot ability for format
72
+ - **Larger model** β€” 0.8B is near capacity ceiling. Successful open GRPO reproductions start at 1.5B+.
73
+
74
+ ### Technical findings
75
+ - **Qwen3.5 DeltaNet needs FLA** β€” install `flash-linear-attention` + `causal-conv1d`, otherwise torch fallback is ~10x slower
76
+ - **SDPA > FLA for inference** β€” 3.6x faster first call. Use `attn_implementation="sdpa"` for eval
77
+ - **Rewards plateau ~epoch 1.2** β€” math_reward at 0.45-0.53, diminishing returns beyond 2 epochs at this scale
78
 
79
  ## Usage
80
 
 
103
 
104
  ## Training Code
105
 
106
+ Full pipeline (Dockerfile, k8s configs, scripts): [github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning](https://github.com/CelestialCreator/gpu-lab/tree/main/projects/05-grpo-reasoning)
 
 
 
 
 
 
 
107
 
108
  ## Acknowledgments
109