File size: 8,867 Bytes
57f9808
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: mit
library_name: numpy
tags:
  - neural-cellular-automata
  - mixture-of-experts
  - arithmetic
  - tiny-models
  - numpy
  - from-scratch
datasets: []
pipeline_tag: other
---

# gary-neuron 🧠➕

**A mesh of ~100 neurons that fire asynchronously and, between them, do arithmetic.**

gary-neuron is an **asynchronous Neural Cellular Automaton** whose per-cell update rule is a **top-2 Mixture-of-Experts**. It is not a transformer. It adds integers the way silicon actually does — by letting a **carry ripple across a strip of cells** — and it does so in **26,448 parameters of pure numpy** (34 KB int8), with a hand-written autograd engine and **zero ML frameworks**.

Same numpy-only soul as [gary-4-petite](https://huggingface.co/gary23w/gary-4-petite). Different question. petite asked *"can a tiny model speak?"* gary-neuron asks *"what if the model isn't one network, but a mesh of tiny neurons firing out of sync — can that compute?"*

It can. **99.97% exact-match on held-out 7-digit addition; 100% with a 9-vote ensemble.**

---

## The three ideas

gary-neuron is the intersection of three research lines, each contributing one piece:

| Idea | What it gives | Source |
|---|---|---|
| **Neural Cellular Automaton** | A strip of identical cells with one **shared** local update rule. Cell *i* perceives only `[left, self, right]`. | Mordvintsev et al., *Growing NCA* (Distill 2020); *Self-Organising Textures* (Distill 2021) |
| **Asynchrony** | Each step, only a **random subset of cells fire**. Breaks grid symmetry, buys robustness, and — crucially — lets carries settle in any order. | *Mesh Neural Cellular Automata* (arXiv:2311.02820, ACM TOG 2024) |
| **Mixture-of-Experts rule** | The shared rule is a router + **K=6 experts, top-2 gating** — so each firing cell uses only *some* of its neurons. A load-balancing loss makes them specialize. | Shazeer et al., *Sparsely-Gated MoE* (2017); Fedus et al., *Switch Transformer* (2021) |

And the task itself rides on a fourth:

> **Reversed-digit format.** The answer is emitted **least-significant digit first** — `12+34 → 64`, not `46`. This is the single change that flips tiny-model addition from "never quite right" to a sharp phase transition to ~100%, because the model predicts the LSB first, the same direction carries flow. *(Lee et al., "Teaching Arithmetic to Small Transformers", arXiv:2307.03381.)*

The beautiful part: **addition-with-carry *is* a cellular automaton.** Cell *i* holds digit *i* of each operand; it needs its own two digits and the carry from cell *i−1*. Carry propagation is local message-passing. So the NCA substrate isn't a gimmick bolted onto arithmetic — it's the natural shape of the problem.

---

## Stats

| | |
|---|---|
| Parameters | **26,448** |
| Weights (int8) | **34 KB** |
| Full release (model + engine + trainer) | ~40 KB |
| Architecture | async 1-D NCA, 8 cells · state dim 32 · 6 experts (top-2) · 3d→32→d expert MLPs |
| Substrate | reversed-digit strip, carry ripples low→high |
| Training | pure-numpy, CPU only, ~9k steps in 35-s bursts, from-scratch autograd |
| Inference | numpy. that's it. no tokenizer, no torch. |
| Hardware | anything that runs python |

The 6 experts end up **evenly used** (utilization 0.16–0.18 each) — the mesh genuinely distributes work across specialists rather than collapsing to one.

---

## How well it adds (measured, held-out, never-trained pairs)

The test space is ~10¹⁴ operand pairs; random train/test overlap is negligible.

| Benchmark | Result |
|---|---|
| **Held-out 10k, ≤7-digit, single async order** | **99.97%** exact-match (mean over 8 random orders, **std 0.02%**) |
| **Held-out 10k, 9-vote async ensemble** | **100.000%** exact-match |
| Exact-match by operand length (1→7 digits) | 99.9% – 100% across the board |
| Adversarial maximal-carry ripples (22 hand-picked) | 21/22 (the one miss is an 8-digit input — out of range for an 8-cell strip) |
| Random spot-check, 300 sums, vote(9) | 300/300 |

**Robustness to update order** is the headline an async CA should own: across 8 totally different random firing orders, exact-match moves by only **±0.02%**. The computation does not depend on *when* each neuron fires.

### Train short, run a little longer

The mesh is trained at 20 async steps but you can run it longer at inference — classic NCA "iterate toward a fixed point":

```
steps :  8     12     16     20     24     28
exact%: 84.7   98.7   99.9   99.95  99.97  99.94
```

24 steps is the sweet spot; past ~28 it drifts slightly (it's a learned attractor, not a perfect fixed point). The released engine defaults to 24.

> **Fully-synchronous (every cell fires every step) is *worse*, not better** — the model learned to rely on asynchrony, exactly the symmetry-breaking the ANCA literature predicts.

---

## Watch the mesh think

`python solve.py 9999999 1 --show` runs the hardest case — a single `+1` that must ripple a carry through all 8 cells — and prints every step. `·` = a cell that didn't fire that step; the number on the right is the live readout.

```
  9999999 + 1   (mesh = 8 cells, 6 experts, top-2, async p=0.5, 24 steps)
  digit place (10^):   7  6  5  4  3  2  1  0
  ----------------------------------------------------
  step  0 digits: 1  1  1  1  9  9  9  1   |  fired(expert#): ·  ·  ·  ·  4  4  4  ·   = 11119991
  step  4 digits: 0  1  9  9  9  9  9  0   |  fired(expert#): 4  ·  ·  5  5  ·  ·  2   = 1999990
  step  8 digits: 1  9  9  9  9  0  0  0   |  fired(expert#): ·  ·  5  5  ·  4  ·  4   = 19999000
  step 12 digits: 0  9  0  0  0  0  0  0   |  fired(expert#): ·  ·  3  4  ·  ·  ·  4   = 9000000
  step 16 digits: 1  0  0  0  0  0  0  0   |  fired(expert#): 4  2  ·  ·  ·  ·  0  0   = 10000000
  step 23 digits: 1  0  0  0  0  0  0  0   |  fired(expert#): ·  2  ·  ·  ·  1  1  ·   = 10000000
  ----------------------------------------------------
  => 9999999 + 1 = 10000000   OK
```

You can see the carry climb from cell 0 to cell 7 and the readout lock onto `10000000` by step ~16, then hold steady — a stable attractor. Different experts (4, 5, 2, 3, 0, 1) fire at different cells: *some neurons fire*, which ones depends on the local situation.

---

## Run it

```bash
pip install numpy
python solve.py 1234567 + 7654321      # -> 8888888
python solve.py 9999999 1 --show       # watch the carry ripple, step by step
python solve.py --vote 9 48591 + 9732  # robust ensemble over 9 async orders
python solve.py                        # interactive
```

No tokenizer, no weights download step beyond this repo, no GPU.

## Reproduce / keep training it

The full pure-numpy pipeline is in `training/` — including the from-scratch reverse-mode autograd (`garyneuron.py`), the finite-difference gradient check (`test_grad.py`), the carry-heavy hard-case miner (`data.py`), and the benchmark harness.

```bash
cd training
python test_grad.py                                  # verify the autograd (analytic vs numeric)
SEC=40 MAXDIG=7 HARD=0.35 python train.py            # one 40-s training burst (resumes from ckpt)
python benchmark.py main                             # held-out + adversarial + by-length
python export_int8.py                                # re-quantize -> release
```

Trained and served entirely in numpy. The autograd, the MoE, the async CA, the int8 packing — all of it, ~700 lines, no frameworks.

---

## What it can't do (yet)

- **8 cells = 8 output digits.** Sums ≥ 10⁸ don't fit; widen `S` and retrain.
- **The single hardest full-length ripple** is right at the edge of the 24-step dynamics; the 9-vote ensemble cleans it up, but a maximally adversarial carry chain longer than the strip will defeat a fixed step budget. (This is the known hard case for *any* fixed-iteration local model.)
- It adds. That's the whole job. Subtraction/multiplication are future meshes.

## Why this exists

To show that "intelligence" at tiny scale doesn't have to be one monolithic network. gary-neuron is **a hundred-odd neurons, firing out of sync, passing notes to their neighbors**, and the *collective* computes something exact. It's a toy — but it's a toy that makes the mesh-of-specialists idea concrete, measurable, and 34 KB.

## Citations

- N. Lee, K. Sreenivasan, J. D. Lee, K. Lee, D. Papailiopoulos. *Teaching Arithmetic to Small Transformers.* arXiv:2307.03381 (2023).
- A. Mordvintsev, E. Niklasson, et al. *Growing Neural Cellular Automata.* Distill (2020). · E. Niklasson et al. *Self-Organising Textures.* Distill (2021).
- *Mesh Neural Cellular Automata.* arXiv:2311.02820, ACM TOG (2024).
- N. Shazeer et al. *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.* (2017). · W. Fedus, B. Zoph, N. Shazeer. *Switch Transformers.* (2021).

*Built with numpy. That's it.*