| --- |
| license: mit |
| library_name: numpy |
| tags: |
| - neural-cellular-automata |
| - mixture-of-experts |
| - arithmetic |
| - tiny-models |
| - numpy |
| - from-scratch |
| datasets: [] |
| pipeline_tag: other |
| --- |
| |
| # gary-neuron 🧠➕ |
|
|
| **A mesh of ~100 neurons that fire asynchronously and, between them, do arithmetic.** |
|
|
| gary-neuron is an **asynchronous Neural Cellular Automaton** whose per-cell update rule is a **top-2 Mixture-of-Experts**. It is not a transformer. It adds integers the way silicon actually does — by letting a **carry ripple across a strip of cells** — and it does so in **26,448 parameters of pure numpy** (34 KB int8), with a hand-written autograd engine and **zero ML frameworks**. |
|
|
| Same numpy-only soul as [gary-4-petite](https://huggingface.co/gary23w/gary-4-petite). Different question. petite asked *"can a tiny model speak?"* gary-neuron asks *"what if the model isn't one network, but a mesh of tiny neurons firing out of sync — can that compute?"* |
|
|
| It can. **99.97% exact-match on held-out 7-digit addition; 100% with a 9-vote ensemble.** |
|
|
| --- |
|
|
| ## The three ideas |
|
|
| gary-neuron is the intersection of three research lines, each contributing one piece: |
|
|
| | Idea | What it gives | Source | |
| |---|---|---| |
| | **Neural Cellular Automaton** | A strip of identical cells with one **shared** local update rule. Cell *i* perceives only `[left, self, right]`. | Mordvintsev et al., *Growing NCA* (Distill 2020); *Self-Organising Textures* (Distill 2021) | |
| | **Asynchrony** | Each step, only a **random subset of cells fire**. Breaks grid symmetry, buys robustness, and — crucially — lets carries settle in any order. | *Mesh Neural Cellular Automata* (arXiv:2311.02820, ACM TOG 2024) | |
| | **Mixture-of-Experts rule** | The shared rule is a router + **K=6 experts, top-2 gating** — so each firing cell uses only *some* of its neurons. A load-balancing loss makes them specialize. | Shazeer et al., *Sparsely-Gated MoE* (2017); Fedus et al., *Switch Transformer* (2021) | |
|
|
| And the task itself rides on a fourth: |
|
|
| > **Reversed-digit format.** The answer is emitted **least-significant digit first** — `12+34 → 64`, not `46`. This is the single change that flips tiny-model addition from "never quite right" to a sharp phase transition to ~100%, because the model predicts the LSB first, the same direction carries flow. *(Lee et al., "Teaching Arithmetic to Small Transformers", arXiv:2307.03381.)* |
|
|
| The beautiful part: **addition-with-carry *is* a cellular automaton.** Cell *i* holds digit *i* of each operand; it needs its own two digits and the carry from cell *i−1*. Carry propagation is local message-passing. So the NCA substrate isn't a gimmick bolted onto arithmetic — it's the natural shape of the problem. |
|
|
| --- |
|
|
| ## Stats |
|
|
| | | | |
| |---|---| |
| | Parameters | **26,448** | |
| | Weights (int8) | **34 KB** | |
| | Full release (model + engine + trainer) | ~40 KB | |
| | Architecture | async 1-D NCA, 8 cells · state dim 32 · 6 experts (top-2) · 3d→32→d expert MLPs | |
| | Substrate | reversed-digit strip, carry ripples low→high | |
| | Training | pure-numpy, CPU only, ~9k steps in 35-s bursts, from-scratch autograd | |
| | Inference | numpy. that's it. no tokenizer, no torch. | |
| | Hardware | anything that runs python | |
|
|
| The 6 experts end up **evenly used** (utilization 0.16–0.18 each) — the mesh genuinely distributes work across specialists rather than collapsing to one. |
|
|
| --- |
|
|
| ## How well it adds (measured, held-out, never-trained pairs) |
|
|
| The test space is ~10¹⁴ operand pairs; random train/test overlap is negligible. |
|
|
| | Benchmark | Result | |
| |---|---| |
| | **Held-out 10k, ≤7-digit, single async order** | **99.97%** exact-match (mean over 8 random orders, **std 0.02%**) | |
| | **Held-out 10k, 9-vote async ensemble** | **100.000%** exact-match | |
| | Exact-match by operand length (1→7 digits) | 99.9% – 100% across the board | |
| | Adversarial maximal-carry ripples (22 hand-picked) | 21/22 (the one miss is an 8-digit input — out of range for an 8-cell strip) | |
| | Random spot-check, 300 sums, vote(9) | 300/300 | |
|
|
| **Robustness to update order** is the headline an async CA should own: across 8 totally different random firing orders, exact-match moves by only **±0.02%**. The computation does not depend on *when* each neuron fires. |
|
|
| ### Train short, run a little longer |
|
|
| The mesh is trained at 20 async steps but you can run it longer at inference — classic NCA "iterate toward a fixed point": |
|
|
| ``` |
| steps : 8 12 16 20 24 28 |
| exact%: 84.7 98.7 99.9 99.95 99.97 99.94 |
| ``` |
|
|
| 24 steps is the sweet spot; past ~28 it drifts slightly (it's a learned attractor, not a perfect fixed point). The released engine defaults to 24. |
|
|
| > **Fully-synchronous (every cell fires every step) is *worse*, not better** — the model learned to rely on asynchrony, exactly the symmetry-breaking the ANCA literature predicts. |
|
|
| --- |
|
|
| ## Watch the mesh think |
|
|
| `python solve.py 9999999 1 --show` runs the hardest case — a single `+1` that must ripple a carry through all 8 cells — and prints every step. `·` = a cell that didn't fire that step; the number on the right is the live readout. |
|
|
| ``` |
| 9999999 + 1 (mesh = 8 cells, 6 experts, top-2, async p=0.5, 24 steps) |
| digit place (10^): 7 6 5 4 3 2 1 0 |
| ---------------------------------------------------- |
| step 0 digits: 1 1 1 1 9 9 9 1 | fired(expert#): · · · · 4 4 4 · = 11119991 |
| step 4 digits: 0 1 9 9 9 9 9 0 | fired(expert#): 4 · · 5 5 · · 2 = 1999990 |
| step 8 digits: 1 9 9 9 9 0 0 0 | fired(expert#): · · 5 5 · 4 · 4 = 19999000 |
| step 12 digits: 0 9 0 0 0 0 0 0 | fired(expert#): · · 3 4 · · · 4 = 9000000 |
| step 16 digits: 1 0 0 0 0 0 0 0 | fired(expert#): 4 2 · · · · 0 0 = 10000000 |
| step 23 digits: 1 0 0 0 0 0 0 0 | fired(expert#): · 2 · · · 1 1 · = 10000000 |
| ---------------------------------------------------- |
| => 9999999 + 1 = 10000000 OK |
| ``` |
|
|
| You can see the carry climb from cell 0 to cell 7 and the readout lock onto `10000000` by step ~16, then hold steady — a stable attractor. Different experts (4, 5, 2, 3, 0, 1) fire at different cells: *some neurons fire*, which ones depends on the local situation. |
|
|
| --- |
|
|
| ## Run it |
|
|
| ```bash |
| pip install numpy |
| python solve.py 1234567 + 7654321 # -> 8888888 |
| python solve.py 9999999 1 --show # watch the carry ripple, step by step |
| python solve.py --vote 9 48591 + 9732 # robust ensemble over 9 async orders |
| python solve.py # interactive |
| ``` |
|
|
| No tokenizer, no weights download step beyond this repo, no GPU. |
|
|
| ## Reproduce / keep training it |
|
|
| The full pure-numpy pipeline is in `training/` — including the from-scratch reverse-mode autograd (`garyneuron.py`), the finite-difference gradient check (`test_grad.py`), the carry-heavy hard-case miner (`data.py`), and the benchmark harness. |
|
|
| ```bash |
| cd training |
| python test_grad.py # verify the autograd (analytic vs numeric) |
| SEC=40 MAXDIG=7 HARD=0.35 python train.py # one 40-s training burst (resumes from ckpt) |
| python benchmark.py main # held-out + adversarial + by-length |
| python export_int8.py # re-quantize -> release |
| ``` |
|
|
| Trained and served entirely in numpy. The autograd, the MoE, the async CA, the int8 packing — all of it, ~700 lines, no frameworks. |
|
|
| --- |
|
|
| ## What it can't do (yet) |
|
|
| - **8 cells = 8 output digits.** Sums ≥ 10⁸ don't fit; widen `S` and retrain. |
| - **The single hardest full-length ripple** is right at the edge of the 24-step dynamics; the 9-vote ensemble cleans it up, but a maximally adversarial carry chain longer than the strip will defeat a fixed step budget. (This is the known hard case for *any* fixed-iteration local model.) |
| - It adds. That's the whole job. Subtraction/multiplication are future meshes. |
|
|
| ## Why this exists |
|
|
| To show that "intelligence" at tiny scale doesn't have to be one monolithic network. gary-neuron is **a hundred-odd neurons, firing out of sync, passing notes to their neighbors**, and the *collective* computes something exact. It's a toy — but it's a toy that makes the mesh-of-specialists idea concrete, measurable, and 34 KB. |
|
|
| ## Citations |
|
|
| - N. Lee, K. Sreenivasan, J. D. Lee, K. Lee, D. Papailiopoulos. *Teaching Arithmetic to Small Transformers.* arXiv:2307.03381 (2023). |
| - A. Mordvintsev, E. Niklasson, et al. *Growing Neural Cellular Automata.* Distill (2020). · E. Niklasson et al. *Self-Organising Textures.* Distill (2021). |
| - *Mesh Neural Cellular Automata.* arXiv:2311.02820, ACM TOG (2024). |
| - N. Shazeer et al. *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.* (2017). · W. Fedus, B. Zoph, N. Shazeer. *Switch Transformers.* (2021). |
|
|
| *Built with numpy. That's it.* |
|
|