license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- snn
- spiking-neural-network
- neuromorphic
- language-model
- from-scratch
- energy-efficient
- mixture-of-experts
- brain-inspired
β‘ Nord v4.2 β Brain-Inspired Spiking Neural Network Language Model (140M)
The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex β trained from scratch.
What's New in v4.2
Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: the model self-organizes into functionally distinct brain zones during training β sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.
| v3 (previous) | v4.2 (current) | |
|---|---|---|
| Parameters | 144M | 140M |
| Sparsity | 97% (but spikes broken at scale) | 91% (spikes working) |
| MoE | None | Spike-driven, 4 experts top-2 |
| Memory | None | 128-neuron cortex, Ο=0.99 |
| Zonal architecture | No | Yes (self-organizing) |
| Loss at 39K steps | ~4.9 | 4.3 |
| Training speed | Slower convergence | 35% faster to same loss |
Model Description
Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only 3-9% β with different brain-inspired zones specializing in different functions.
Trained entirely from scratch β no transformer teacher, no distillation, no ANN-to-SNN conversion.
Key Features
| Feature | Details |
|---|---|
| Parameters | 139.9M |
| Architecture | Original brain-inspired zonal SNN |
| Zones | Sensory β Association (MoE) β Memory β Executive |
| MoE | 4 spike-driven experts, top-2 routing |
| Memory | 128 persistent neurons, gated temporal attention |
| Sparsity | 89-95% (dynamic, input-dependent) |
| Timesteps | 10 (8 fast + 2 slow) |
| Training method | Surrogate gradients + spike homeostasis |
| Training data | ~2.2M samples, general English corpus |
| Training cost | ~$15 USD |
| Online learning | STDP available during inference |
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Temporal Spike Encoder β
β Token β 8 fast + 2 slow timestep currents β
βββββββββββββββββββββββββββββββββββββββββββββββββ€
β Sensory Zone (2 blocks) rates: 8-10% β
β Standard FFN + LIF, feature extraction β
βββββββββββββββββββββββββββββββββββββββββββββββββ€
β Association Zone (2 blocks) rates: 10-14% β
β Spike-Driven MoE (4 experts, top-2) + LIF β
βββββββββββββββββββββββββββββββββββββββββββββββββ€
β Memory Cortex rates: 0.5-1% β
β 128 neurons, Ο=0.99, gated temporal attn β
βββββββββββββββββββββββββββββββββββββββββββββββββ€
β Executive Zone (2 blocks) rates: 11-26% β
β Standard FFN + LIF, decision & output β
βββββββββββββββββββββββββββββββββββββββββββββββββ€
β Readout (EMA over membrane potential) β
β β LM Head β vocabulary logits β
βββββββββββββββββββββββββββββββββββββββββββββββββ
Key Components
- Associative LIF Neurons β Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
- ATan Surrogate Gradient β Differentiable spike function for backpropagation
- Spike-Driven MoE β Expert routing based on cluster spike-rate activity, not dense networks
- Memory Cortex β Persistent slow memory with multi-head temporal attention readout
- Adaptive Spike Regulator β Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
- RoPE β Rotary position embeddings for sequence position encoding
- Synaptic Resonance Attention β Temporal mixing over spike patterns (not naive flattening)
Model Configuration
d_model: 496
n_heads: 8
n_layers: 6 (2 sensory + 2 association + 2 executive)
d_ff: 1024
n_experts: 4
top_k_experts: 2
memory_size: 128
T_fast: 8, T_slow: 2
max_seq_len: 512
vocab_size: 128,256
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
Emergent Zonal Specialization
The most significant finding: the model self-organizes functionally distinct zones during standard training. No manual assignment, no hardcoded rates.
Zone Spike Rate Biological Analog
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sensory 8-10% Primary sensory cortex
Association 10-14% Parietal/temporal cortex
Memory Cortex 0.5-1% Hippocampus (selective)
Executive [0] 11-15% Premotor cortex
Executive [1] 22-26% Prefrontal cortex
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.
Training
- Dataset: ~2.2M text samples, general English corpus
- Hardware: NVIDIA A5000 24GB (rented on Vast.ai)
- Optimizer: AdamW (lr=3e-4 β 1e-5 cosine decay, weight_decay=0.01)
- Batch size: 2 Γ grad_accum=16 (effective 32)
- Sequence length: 512
Loss Progression
| Step | Loss | Sparsity | LR | Event |
|---|---|---|---|---|
| 0 | 8.9 | 68% | warmup | Start |
| 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent |
| 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying |
| 14,000 | 7.6β5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival |
| 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 |
| 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay |
| 39,000 | 4.30 | 91% | 6.0e-05 | Current best |
Parameter Breakdown
| Component | Parameters |
|---|---|
| Sensory Zone | 4.0M (2 blocks) |
| Association Zone | 4.1M (2 blocks, MoE) |
| Memory Cortex | 0.2M |
| Executive Zone | 4.0M (2 blocks) |
| Encoder + Readout + LM Head | ~127.6M |
| Total | 139.9M |
Usage
import torch
from nord_core_v4 import NordConfig, NordModel
from transformers import AutoTokenizer
# Load
ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
cfg = NordConfig(**ckpt["config"])
model = NordModel(cfg).cuda()
# Filter persistent state buffers (size varies with batch)
state = {k: v for k, v in ckpt["model_state_dict"].items()
if "_v_mem_state" not in k and "_i_syn_state" not in k}
model.load_state_dict(state, strict=False)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
Or use the interactive chat:
python chat_v4.py
# Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit
Generation Examples
Step 3,600 (loss 5.5) β no coherence:
"Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."
Step 29,000 (loss 4.5) β topic understanding, broken logic:
"The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."
Step 39,000 (loss 4.3) β thematic coherence, real entities:
"A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."
Spike Dynamics
| Context | Sparsity | Interpretation |
|---|---|---|
| Simple tokens | 95-96% | Confident β minimal firing |
| Complex tokens | 89-91% | More neurons recruited |
| Training average | 91% | Healthy spike activity |
Sparsity is dynamic and input-dependent β the model recruits more neurons for harder inputs, just like a biological brain.
Comparison with Other SNN Language Models
| Model | Params | From Scratch? | MoE | Zonal | Sparsity |
|---|---|---|---|---|---|
| Nord v4.2 | 140M | β | β | β | 91% |
| Nord v3 | 144M | β | β | β | 97% |
| SpikeGPT | 216M | β | β | β | ~90% |
| SpikeLLM | 7-70B | β | β | β | varies |
| SpikeBERT | ~110M | β | β | β | varies |
Version History
| Version | Key Change | Result |
|---|---|---|
| v3 | First SNN LLM | 97% sparsity, 51K Reddit views |
| v3.5 | Scale to 500M | Failed β sparsity stuck at 100% |
| v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 |
| v4.2 | Adaptive regulator + Executive fix | Loss 4.3, stable 91% sparsity |
Limitations
- Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
- Coherence degrades after 2-3 sentences at 140M scale
- Multilingual leakage in long generations (dataset artifact)
- Scaling beyond 140M untested for v4.2
- No formal benchmark evaluation yet
- Hallucination present
Scaling Hypothesis
If zonal specialization persists at scale, an 86B SNN could potentially:
- Match 86B transformer quality
- Run inference with compute of a 3-4B dense model (96% sparsity)
- Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings
This is unproven. The roadmap: 140M β 500M β 1-2B, testing at each scale.
Citation
@software{nord2026,
title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
author={Zemondsa},
year={2026},
url={https://github.com/zemondsa/nord-ai}
}
About
Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding β just a rented A5000 and curiosity.