cortex-conv / PAPER_COMPANION.md
Ex0bit's picture
drop 'worse than' framing — lead with what cortex-conv achieves, not what it gives up
c70a211

cortex-conv: browser-native equilibrium propagation

A 34,106-weight convolutional neural network trained with Equilibrium Propagation — no backpropagation — running entirely in your browser on WebGPU. Ships pre-trained at 96.8% MNIST test accuracy.

Author: Eric Albayrak.

Live demo: fhn.html — page boots at 96.8% on MNIST, no training required.


Abstract

cortex-conv is a small convolutional network (34,106 trainable weights) that learns MNIST and Fashion-MNIST using only Equilibrium Propagation — two forward passes per sample, no backward graph, no transposed weights, no automatic differentiation. It runs entirely in a browser on WebGPU. The shipped weights reach 96.8% MNIST test accuracy (3.2% test error) on the very first page load. By comparison, the largest published EqProp result on MNIST that we are aware of (Kendall 2026, FHN-EqProp, arXiv:2605.21568) trains a 5-hidden-layer 784–512×5–10 network with ~1.46M parameters to 97.2% test accuracy on the same task — within 0.4 pp of cortex-conv at ~43× the parameter count and requiring a Python + GPU runtime. Other published EqProp variants reach 97.2–97.6% (Scellier & Bengio 2017, EqSpike 2021, Oscillator Ising 2025) but require neuromorphic, optical, or laboratory hardware. cortex-conv reaches comparable accuracy at the smallest known size, with browser-only deployment and instant pre-trained loading. The trained snapshot is 720 KB; total bundle size including the 60K MNIST and 60K Fashion-MNIST data packs is ~124 MB.

1. What the paper proves (and what it leaves open)

The Kendall paper makes three central claims about a diffusively coupled FHN network:

  1. Self-adjointness at steady state. Linearise the network around a steady state and the activator's effective response matrix $M^{-1} = (A + B^{-1})^{-1}$ is a sum of symmetric operators, so $M^{-1} = (M^{-1})^\top$. A self-adjoint response means a small perturbation injected during inference carries the same information a backward pass would have. This is what makes EqProp work on FHN at all.
  2. Two-variable EqProp. Standard EqProp theory (Scellier & Bengio, 2017) assumes a single state variable per unit. The paper extends the proof to FHN's two-variable (activator + inhibitor) skew-gradient structure.
  3. Hamiltonian inference (depth = time). Stationary solutions satisfy a spatial Hamiltonian conserved along the path graph. Given an initial $(u_0, p_0)$ pair you can march one layer at a time in a single forward pass, replacing ~50 iterative settling steps. The march tracks the true steady state exactly with a clean input but diverges along an unstable manifold once input noise enters; the wall sits near layer 30 in the paper's chain (paper §V, Fig. 2).

What the paper does not do: it does not train any classifier, on MNIST or otherwise. The only numerical demonstration is the 30-layer Hamiltonian-vs-iterative tracking experiment in Fig. 2.

This left a gap: the theory says EqProp works on FHN, but nobody had shown it could train a competitive image classifier at this neuron's level of biological detail. Closing that gap is what this companion documents.

2. The cortex-conv extension

Four orthogonal ingredients, each from a separate recent paper, plus a small convolutional architecture and a banked V1-style kernel mask, push EqProp on FHN from "works in theory" to "96.8% MNIST in your browser."

2.1 Adaptation current — Liu & Chen 2025

Reference: Zhuo Liu, Tao Chen. Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections. arXiv:2508.11659 (USTC, August 2025).

The paper shows that adding an intrinsic per-neuron adaptation current — a slow self-inhibitory feedback that damps a neuron's drive as it fires — reduces the spectral radius of the equilibrium operator and lets EqProp converge in far fewer iterations. We carry this into our FHN-style neuron: each cortex unit gets a 1/(1 + exp(-(c - threshold))) self-damping term that lives alongside the standard activator dynamics.

In practice this is what lets us run with itF=32 settling steps per sample (down from the paper's 50+) without losing gradient quality.

2.2 Adjusted adaptation — Kubo, Chalmers & Luczak 2022

Reference: Yoshimasa Kubo, Eric Chalmers, Artur Luczak. Biologically-inspired neuronal adaptation improves learning in neural networks. arXiv:2204.14008 (April 2022).

After the ±β-clamped phase, instead of using the clamped activity directly as the contrast term, we relax it back toward the free-phase activity over a small number of steps (we use 3, with mixing coefficient 0.15). This shrinks the bias between the free/clamped finite-difference and the true gradient. Without this adjustment, our 3-conv network plateaued ~3 points lower in our autoresearch sweeps.

2.3 Reward signal — third-factor modulation

A global scalar reward $r = r_{\min} + (1 - r_{\min}) \cdot \min(1, L / e_{\text{scale}})$ multiplies every weight change, scaling updates by how wrong the current answer is. This is the standard "third factor" of three-factor learning rules (Frémaux & Gerstner, 2016); we set $r_{\min} = 0.1$ and $e_{\text{scale}} = 0.4$. The effect is that confident-correct samples barely move weights while confident-wrong samples drive bigger updates, which empirically tightens the convergence late in training.

2.4 AdaGO optimiser — Zhang, Liu & Schaeffer 2025

Reference: Minxin Zhang, Yuxuan Liu, Hayden Schaeffer. AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates. arXiv:2509.02981 (September 2025).

The natural pairing with EqProp's bias-prone gradient is a Muon-style orthogonalised momentum update direction (Newton-Schulz quintic iteration on the momentum matrix) combined with a norm-based AdaGrad stepsize. We use the implementation from tests/gpu_lib_deep.js: $v^2 += \min(|G|^2, \gamma^2)$, $\Theta += \max(\varepsilon, \eta \cdot \min(|G|, \gamma) / v) \cdot \text{Orth}(M)$. This sidesteps both Muon's "what's my learning rate?" problem and plain AdaGrad's decay-to-zero stalling.

2.5 Neuron model: leaky integrator with sigmoid activation

Important honesty point. The paper proves EqProp works on the FHN neuron (cubic activator + slow inhibitor, du/dt = u − u³ − v + dv/dt = ε(u − αv − β)). The cortex-conv network uses a simpler neuron model that the paper's theory also covers as a degenerate case: a leaky integrator with sigmoid activation,

dudt=u+σ(Wρ+γfb)\frac{du}{dt} = -u + \sigma(W\rho + \gamma \cdot \text{fb})

We use the simpler neuron because EqProp's theoretical guarantee covers the broader skew-gradient + steady-state class that includes both, and the simpler dynamics reached the highest accuracy in our autoresearch sweeps (the cubic FHN activator's separate inhibitor variable added training instability with no accuracy gain at this scale). Panel 00 of the live demo still shows the actual FHN reaction-diffusion sim (cubic activator, slow inhibitor) — that's the substrate the paper's theory was developed for, just not the substrate the cortex-conv trainer uses.

2.6 Architecture: 2-conv with a banked V1 kernel mask

input (28×28 single-channel)
  → conv0: 1→16  kernel 5×5  stride 2  pad 2   (output 14×14×16)
  → conv1: 16→16 kernel 3×3  stride 1  pad 1   (output 14×14×16)
  → flatten (3136)
  → dense: 3136 → 10

Total: 34,106 trainable weights. The conv0 kernel is multiplied at every step by a hand-shaped banked oriented mask: each of 4 banks gets an anisotropic Gaussian envelope rotated by b · π / 4, mimicking a V1 hypercolumn. This is an algorithmic prior — no learned hyperparameters, derived from kernel geometry alone — that empirically gives ~0.5 pp accuracy and faster convergence than an unmasked kernel.

The top-down feedback gain γ is set to 0.1 (an autoresearch finding documented as v06 in this codebase's project memory: standard γ=0.6 destabilises 2-conv conv1 because of fan-in mismatch between bottom-up and top-down inputs).

A pre-σ drive clamp at [-2.5, 3.5] prevents saturation runaway during the first few hundred batches.

3. Results

Setup Test accuracy Wall time
cortex-conv R=28, ships pre-trained 96.8% 0 s (on page boot)
cortex-conv R=14, 30 passes via WebGPU 96.5% ~9 min on M4 Max GPU
Same arch trained on Fashion-MNIST from scratch 84% (3 passes), ~88% (10 passes) ~10 min
Single-conv baseline (1 conv layer) 94.4% @ 40 passes ~6 min
Dense baseline [784, H=120, 10] (paper-style FHN) ~92% ~15 min

All results are single-model. No ensembling, no test-time augmentation, no exponential moving average, no input augmentation. The 96.8% number is measured on a 1000-sample held-out MNIST test split immediately after applying the shipped snapshot.

The honest ceiling under these "no crutches" constraints sits at ~97% on MNIST single-model. Pushing past requires either ensembling (K=10 gives 97.5-98%) or a longer schedule with EMA, neither of which we ship.

4. Implementation notes

  • Pure WebGPU. Training and inference both run as compute shaders, dispatched from a single HTML page. No Python, no native code, no backend server.
  • Trained snapshot bundled. weights/cortex_conv_mnist_R28.json (720 KB) is fetched on boot and uploaded to GPU buffers before the test set is evaluated. Result: the page lands at 96.8% before the user has clicked anything.
  • Browser-only reproducibility. tools/train_cortex_dump.cjs is a headless Playwright driver that regenerates the snapshot from a fresh random init. Useful for verifying the trajectory or for retraining after architecture changes.
  • No backprop anywhere. The gradient comes purely from the EqProp finite difference between ±β-clamped relaxed states; the optimiser then orthogonalises the momentum matrix without any reverse-mode auto-diff.

5. What this is not

This is not a claim of state-of-the-art on MNIST — there are convolutional networks that reach >99.7% with backprop. It's also not the first MNIST result on the FHN neuron — the paper itself trains a 5-hidden-layer 784–512×5–10 network to 2.8% test error using ~1.46M parameters. The claim here is narrower: the same paper's EqProp setup is reproducible in a much smaller, faster, browser-side form (34K params, ~3.2% test error, no Python, runs on WebGPU), demonstrating that the paper's theoretical scaffolding scales down as well as up.

We deliberately do not ship pre-trained Fashion-MNIST weights so that visitors can watch the learning curve climb live for at least one dataset.

6. Reproducibility

# Regenerate the snapshot from random init (requires WebGPU)
node tools/train_cortex_dump.cjs \
    --target=0.96 \
    --max-passes=10 \
    --out=weights/cortex_conv_mnist_R28.json

The training script boots fhn.html in headless Chromium with WebGPU enabled, hits train, polls every 30 s, and dumps the best-accuracy snapshot.

References

  1. Kendall, J. Equilibrium Propagation and Hamiltonian Inference in the Diffusive FitzHugh-Nagumo Model, Zyphra, 2026. arXiv:2605.21568
  2. Scellier, B., Bengio, Y. Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation, Frontiers in Computational Neuroscience, 2017.
  3. Liu, Z., Chen, T. Toward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections. arXiv:2508.11659 (2025).
  4. Kubo, Y., Chalmers, E., Luczak, A. Biologically-inspired neuronal adaptation improves learning in neural networks. arXiv:2204.14008 (2022).
  5. Zhang, M., Liu, Y., Schaeffer, H. AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates. arXiv:2509.02981 (2025).
  6. Frémaux, N., Gerstner, W. Neuromodulated Spike-Timing-Dependent Plasticity, and Theory of Three-Factor Learning Rules. Frontiers in Neural Circuits, 2016.