File size: 8,867 Bytes
57f9808 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
license: mit
library_name: numpy
tags:
- neural-cellular-automata
- mixture-of-experts
- arithmetic
- tiny-models
- numpy
- from-scratch
datasets: []
pipeline_tag: other
---
# gary-neuron 🧠➕
**A mesh of ~100 neurons that fire asynchronously and, between them, do arithmetic.**
gary-neuron is an **asynchronous Neural Cellular Automaton** whose per-cell update rule is a **top-2 Mixture-of-Experts**. It is not a transformer. It adds integers the way silicon actually does — by letting a **carry ripple across a strip of cells** — and it does so in **26,448 parameters of pure numpy** (34 KB int8), with a hand-written autograd engine and **zero ML frameworks**.
Same numpy-only soul as [gary-4-petite](https://huggingface.co/gary23w/gary-4-petite). Different question. petite asked *"can a tiny model speak?"* gary-neuron asks *"what if the model isn't one network, but a mesh of tiny neurons firing out of sync — can that compute?"*
It can. **99.97% exact-match on held-out 7-digit addition; 100% with a 9-vote ensemble.**
---
## The three ideas
gary-neuron is the intersection of three research lines, each contributing one piece:
| Idea | What it gives | Source |
|---|---|---|
| **Neural Cellular Automaton** | A strip of identical cells with one **shared** local update rule. Cell *i* perceives only `[left, self, right]`. | Mordvintsev et al., *Growing NCA* (Distill 2020); *Self-Organising Textures* (Distill 2021) |
| **Asynchrony** | Each step, only a **random subset of cells fire**. Breaks grid symmetry, buys robustness, and — crucially — lets carries settle in any order. | *Mesh Neural Cellular Automata* (arXiv:2311.02820, ACM TOG 2024) |
| **Mixture-of-Experts rule** | The shared rule is a router + **K=6 experts, top-2 gating** — so each firing cell uses only *some* of its neurons. A load-balancing loss makes them specialize. | Shazeer et al., *Sparsely-Gated MoE* (2017); Fedus et al., *Switch Transformer* (2021) |
And the task itself rides on a fourth:
> **Reversed-digit format.** The answer is emitted **least-significant digit first** — `12+34 → 64`, not `46`. This is the single change that flips tiny-model addition from "never quite right" to a sharp phase transition to ~100%, because the model predicts the LSB first, the same direction carries flow. *(Lee et al., "Teaching Arithmetic to Small Transformers", arXiv:2307.03381.)*
The beautiful part: **addition-with-carry *is* a cellular automaton.** Cell *i* holds digit *i* of each operand; it needs its own two digits and the carry from cell *i−1*. Carry propagation is local message-passing. So the NCA substrate isn't a gimmick bolted onto arithmetic — it's the natural shape of the problem.
---
## Stats
| | |
|---|---|
| Parameters | **26,448** |
| Weights (int8) | **34 KB** |
| Full release (model + engine + trainer) | ~40 KB |
| Architecture | async 1-D NCA, 8 cells · state dim 32 · 6 experts (top-2) · 3d→32→d expert MLPs |
| Substrate | reversed-digit strip, carry ripples low→high |
| Training | pure-numpy, CPU only, ~9k steps in 35-s bursts, from-scratch autograd |
| Inference | numpy. that's it. no tokenizer, no torch. |
| Hardware | anything that runs python |
The 6 experts end up **evenly used** (utilization 0.16–0.18 each) — the mesh genuinely distributes work across specialists rather than collapsing to one.
---
## How well it adds (measured, held-out, never-trained pairs)
The test space is ~10¹⁴ operand pairs; random train/test overlap is negligible.
| Benchmark | Result |
|---|---|
| **Held-out 10k, ≤7-digit, single async order** | **99.97%** exact-match (mean over 8 random orders, **std 0.02%**) |
| **Held-out 10k, 9-vote async ensemble** | **100.000%** exact-match |
| Exact-match by operand length (1→7 digits) | 99.9% – 100% across the board |
| Adversarial maximal-carry ripples (22 hand-picked) | 21/22 (the one miss is an 8-digit input — out of range for an 8-cell strip) |
| Random spot-check, 300 sums, vote(9) | 300/300 |
**Robustness to update order** is the headline an async CA should own: across 8 totally different random firing orders, exact-match moves by only **±0.02%**. The computation does not depend on *when* each neuron fires.
### Train short, run a little longer
The mesh is trained at 20 async steps but you can run it longer at inference — classic NCA "iterate toward a fixed point":
```
steps : 8 12 16 20 24 28
exact%: 84.7 98.7 99.9 99.95 99.97 99.94
```
24 steps is the sweet spot; past ~28 it drifts slightly (it's a learned attractor, not a perfect fixed point). The released engine defaults to 24.
> **Fully-synchronous (every cell fires every step) is *worse*, not better** — the model learned to rely on asynchrony, exactly the symmetry-breaking the ANCA literature predicts.
---
## Watch the mesh think
`python solve.py 9999999 1 --show` runs the hardest case — a single `+1` that must ripple a carry through all 8 cells — and prints every step. `·` = a cell that didn't fire that step; the number on the right is the live readout.
```
9999999 + 1 (mesh = 8 cells, 6 experts, top-2, async p=0.5, 24 steps)
digit place (10^): 7 6 5 4 3 2 1 0
----------------------------------------------------
step 0 digits: 1 1 1 1 9 9 9 1 | fired(expert#): · · · · 4 4 4 · = 11119991
step 4 digits: 0 1 9 9 9 9 9 0 | fired(expert#): 4 · · 5 5 · · 2 = 1999990
step 8 digits: 1 9 9 9 9 0 0 0 | fired(expert#): · · 5 5 · 4 · 4 = 19999000
step 12 digits: 0 9 0 0 0 0 0 0 | fired(expert#): · · 3 4 · · · 4 = 9000000
step 16 digits: 1 0 0 0 0 0 0 0 | fired(expert#): 4 2 · · · · 0 0 = 10000000
step 23 digits: 1 0 0 0 0 0 0 0 | fired(expert#): · 2 · · · 1 1 · = 10000000
----------------------------------------------------
=> 9999999 + 1 = 10000000 OK
```
You can see the carry climb from cell 0 to cell 7 and the readout lock onto `10000000` by step ~16, then hold steady — a stable attractor. Different experts (4, 5, 2, 3, 0, 1) fire at different cells: *some neurons fire*, which ones depends on the local situation.
---
## Run it
```bash
pip install numpy
python solve.py 1234567 + 7654321 # -> 8888888
python solve.py 9999999 1 --show # watch the carry ripple, step by step
python solve.py --vote 9 48591 + 9732 # robust ensemble over 9 async orders
python solve.py # interactive
```
No tokenizer, no weights download step beyond this repo, no GPU.
## Reproduce / keep training it
The full pure-numpy pipeline is in `training/` — including the from-scratch reverse-mode autograd (`garyneuron.py`), the finite-difference gradient check (`test_grad.py`), the carry-heavy hard-case miner (`data.py`), and the benchmark harness.
```bash
cd training
python test_grad.py # verify the autograd (analytic vs numeric)
SEC=40 MAXDIG=7 HARD=0.35 python train.py # one 40-s training burst (resumes from ckpt)
python benchmark.py main # held-out + adversarial + by-length
python export_int8.py # re-quantize -> release
```
Trained and served entirely in numpy. The autograd, the MoE, the async CA, the int8 packing — all of it, ~700 lines, no frameworks.
---
## What it can't do (yet)
- **8 cells = 8 output digits.** Sums ≥ 10⁸ don't fit; widen `S` and retrain.
- **The single hardest full-length ripple** is right at the edge of the 24-step dynamics; the 9-vote ensemble cleans it up, but a maximally adversarial carry chain longer than the strip will defeat a fixed step budget. (This is the known hard case for *any* fixed-iteration local model.)
- It adds. That's the whole job. Subtraction/multiplication are future meshes.
## Why this exists
To show that "intelligence" at tiny scale doesn't have to be one monolithic network. gary-neuron is **a hundred-odd neurons, firing out of sync, passing notes to their neighbors**, and the *collective* computes something exact. It's a toy — but it's a toy that makes the mesh-of-specialists idea concrete, measurable, and 34 KB.
## Citations
- N. Lee, K. Sreenivasan, J. D. Lee, K. Lee, D. Papailiopoulos. *Teaching Arithmetic to Small Transformers.* arXiv:2307.03381 (2023).
- A. Mordvintsev, E. Niklasson, et al. *Growing Neural Cellular Automata.* Distill (2020). · E. Niklasson et al. *Self-Organising Textures.* Distill (2021).
- *Mesh Neural Cellular Automata.* arXiv:2311.02820, ACM TOG (2024).
- N. Shazeer et al. *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.* (2017). · W. Fedus, B. Zoph, N. Shazeer. *Switch Transformers.* (2021).
*Built with numpy. That's it.*
|