codelion
/

sprog-9m

@@ -55,7 +55,7 @@ Evaluated on the **full GSM8K test set** (1,319 problems), 3 training seeds:
 | metric | GSM8K test |
 |---|---|
 | **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
-| accuracy (plurality vote, no verifier) | ~9.3% |
 | trainable parameters | **9.37M** |
 | LLM used at inference | **none** |
@@ -71,7 +71,7 @@ stable across seeds on the test set (range 11.1–12.6%).
 - **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
   symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
   by vote frequency.
-- **Data is the main lever.** Trained on real GSM8K-train plus ~117K LLM-generated
   GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
   step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
   or model size — a deeper/bigger model did not help beyond noise.
@@ -102,7 +102,7 @@ dependencies beyond `mlx` and `numpy`.
 ## Limitations
 This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
-on GSM8K (~12%). It handles 1–4 step arithmetic word problems with common operations; it
 misses many multi-step problems that require deeper reading comprehension. It is not a
 general math model and should not be used as one.

 | metric | GSM8K test |
 |---|---|
 | **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
+| accuracy (plurality vote, no verifier) | ≈9.3% |
 | trainable parameters | **9.37M** |
 | LLM used at inference | **none** |
 - **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
   symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
   by vote frequency.
+- **Data is the main lever.** Trained on real GSM8K-train plus ≈117K LLM-generated
   GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
   step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
   or model size — a deeper/bigger model did not help beyond noise.
 ## Limitations
 This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
+on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it
 misses many multi-step problems that require deeper reading comprehension. It is not a
 general math model and should not be used as one.