Instructions to use codelion/sprog-9m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use codelion/sprog-9m with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir sprog-9m codelion/sprog-9m
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Fix strikethrough rendering (tildes -> approx symbol)
Browse files
README.md
CHANGED
|
@@ -55,7 +55,7 @@ Evaluated on the **full GSM8K test set** (1,319 problems), 3 training seeds:
|
|
| 55 |
| metric | GSM8K test |
|
| 56 |
|---|---|
|
| 57 |
| **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
|
| 58 |
-
| accuracy (plurality vote, no verifier) |
|
| 59 |
| trainable parameters | **9.37M** |
|
| 60 |
| LLM used at inference | **none** |
|
| 61 |
|
|
@@ -71,7 +71,7 @@ stable across seeds on the test set (range 11.1–12.6%).
|
|
| 71 |
- **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
|
| 72 |
symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
|
| 73 |
by vote frequency.
|
| 74 |
-
- **Data is the main lever.** Trained on real GSM8K-train plus
|
| 75 |
GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
|
| 76 |
step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
|
| 77 |
or model size — a deeper/bigger model did not help beyond noise.
|
|
@@ -102,7 +102,7 @@ dependencies beyond `mlx` and `numpy`.
|
|
| 102 |
## Limitations
|
| 103 |
|
| 104 |
This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
|
| 105 |
-
on GSM8K (
|
| 106 |
misses many multi-step problems that require deeper reading comprehension. It is not a
|
| 107 |
general math model and should not be used as one.
|
| 108 |
|
|
|
|
| 55 |
| metric | GSM8K test |
|
| 56 |
|---|---|
|
| 57 |
| **accuracy (symbolic verifier)** | **11.8%** mean — 12.6% best seed |
|
| 58 |
+
| accuracy (plurality vote, no verifier) | ≈9.3% |
|
| 59 |
| trainable parameters | **9.37M** |
|
| 60 |
| LLM used at inference | **none** |
|
| 61 |
|
|
|
|
| 71 |
- **Self-consistency + free verifier.** 96 sampled programs are scored by a 0-parameter
|
| 72 |
symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken
|
| 73 |
by vote frequency.
|
| 74 |
+
- **Data is the main lever.** Trained on real GSM8K-train plus ≈117K LLM-generated
|
| 75 |
GSM8K-style problems (Claude + Gemini). What mattered most was **matching the real GSM8K
|
| 76 |
step-distribution** and **rigorous decontamination** (0% test overlap), not raw data volume
|
| 77 |
or model size — a deeper/bigger model did not help beyond noise.
|
|
|
|
| 102 |
## Limitations
|
| 103 |
|
| 104 |
This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go
|
| 105 |
+
on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it
|
| 106 |
misses many multi-step problems that require deeper reading comprehension. It is not a
|
| 107 |
general math model and should not be used as one.
|
| 108 |
|