--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - snn - spiking-neural-network - neuromorphic - language-model - from-scratch - energy-efficient - mixture-of-experts - brain-inspired --- # ⚡ Nord v4.2 — Brain-Inspired Spiking Neural Network Language Model (140M) **The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex — trained from scratch.** ## What's New in v4.2 Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** — sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision. | | v3 (previous) | v4.2 (current) | |---|---|---| | **Parameters** | 144M | 140M | | **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) | | **MoE** | None | Spike-driven, 4 experts top-2 | | **Memory** | None | 128-neuron cortex, τ=0.99 | | **Zonal architecture** | No | Yes (self-organizing) | | **Loss at 39K steps** | ~4.9 | **4.3** | | **Training speed** | Slower convergence | 35% faster to same loss | ## Model Description Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** — with different brain-inspired zones specializing in different functions. Trained **entirely from scratch** — no transformer teacher, no distillation, no ANN-to-SNN conversion. ## Key Features | Feature | Details | |---------|---------| | Parameters | 139.9M | | Architecture | Original brain-inspired zonal SNN | | Zones | Sensory → Association (MoE) → Memory → Executive | | MoE | 4 spike-driven experts, top-2 routing | | Memory | 128 persistent neurons, gated temporal attention | | Sparsity | 89-95% (dynamic, input-dependent) | | Timesteps | 10 (8 fast + 2 slow) | | Training method | Surrogate gradients + spike homeostasis | | Training data | ~2.2M samples, general English corpus | | Training cost | ~$15 USD | | Online learning | STDP available during inference | ## Architecture ``` ┌───────────────────────────────────────────────┐ │ Temporal Spike Encoder │ │ Token → 8 fast + 2 slow timestep currents │ ├───────────────────────────────────────────────┤ │ Sensory Zone (2 blocks) rates: 8-10% │ │ Standard FFN + LIF, feature extraction │ ├───────────────────────────────────────────────┤ │ Association Zone (2 blocks) rates: 10-14% │ │ Spike-Driven MoE (4 experts, top-2) + LIF │ ├───────────────────────────────────────────────┤ │ Memory Cortex rates: 0.5-1% │ │ 128 neurons, τ=0.99, gated temporal attn │ ├───────────────────────────────────────────────┤ │ Executive Zone (2 blocks) rates: 11-26% │ │ Standard FFN + LIF, decision & output │ ├───────────────────────────────────────────────┤ │ Readout (EMA over membrane potential) │ │ → LM Head → vocabulary logits │ └───────────────────────────────────────────────┘ ``` ### Key Components - **Associative LIF Neurons** — Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters - **ATan Surrogate Gradient** — Differentiable spike function for backpropagation - **Spike-Driven MoE** — Expert routing based on cluster spike-rate activity, not dense networks - **Memory Cortex** — Persistent slow memory with multi-head temporal attention readout - **Adaptive Spike Regulator** — Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1% - **RoPE** — Rotary position embeddings for sequence position encoding - **Synaptic Resonance Attention** — Temporal mixing over spike patterns (not naive flattening) ### Model Configuration ```python d_model: 496 n_heads: 8 n_layers: 6 (2 sensory + 2 association + 2 executive) d_ff: 1024 n_experts: 4 top_k_experts: 2 memory_size: 128 T_fast: 8, T_slow: 2 max_seq_len: 512 vocab_size: 128,256 tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B) ``` ## Emergent Zonal Specialization The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates. ``` Zone Spike Rate Biological Analog ───────────────────────────────────────────────────── Sensory 8-10% Primary sensory cortex Association 10-14% Parietal/temporal cortex Memory Cortex 0.5-1% Hippocampus (selective) Executive [0] 11-15% Premotor cortex Executive [1] 22-26% Prefrontal cortex ───────────────────────────────────────────────────── ``` This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex. ## Training - **Dataset:** ~2.2M text samples, general English corpus - **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai) - **Optimizer:** AdamW (lr=3e-4 → 1e-5 cosine decay, weight_decay=0.01) - **Batch size:** 2 × grad_accum=16 (effective 32) - **Sequence length:** 512 ### Loss Progression | Step | Loss | Sparsity | LR | Event | |------|------|----------|-----|-------| | 0 | 8.9 | 68% | warmup | Start | | 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent | | 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying | | 14,000 | 7.6→5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival | | 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 | | 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay | | 39,000 | 4.30 | 91% | 6.0e-05 | Current best | ### Parameter Breakdown | Component | Parameters | |-----------|-----------| | Sensory Zone | 4.0M (2 blocks) | | Association Zone | 4.1M (2 blocks, MoE) | | Memory Cortex | 0.2M | | Executive Zone | 4.0M (2 blocks) | | Encoder + Readout + LM Head | ~127.6M | | **Total** | **139.9M** | ## Usage ```python import torch from nord_core_v4 import NordConfig, NordModel from transformers import AutoTokenizer # Load ckpt = torch.load("nord_v4_latest.pt", map_location="cuda") cfg = NordConfig(**ckpt["config"]) model = NordModel(cfg).cuda() # Filter persistent state buffers (size varies with batch) state = {k: v for k, v in ckpt["model_state_dict"].items() if "_v_mem_state" not in k and "_i_syn_state" not in k} model.load_state_dict(state, strict=False) model.eval() tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") ``` Or use the interactive chat: ```bash python chat_v4.py # Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit ``` ## Generation Examples **Step 3,600 (loss 5.5)** — no coherence: > "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees." **Step 29,000 (loss 4.5)** — topic understanding, broken logic: > "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems." **Step 39,000 (loss 4.3)** — thematic coherence, real entities: > "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..." ## Spike Dynamics | Context | Sparsity | Interpretation | |---------|----------|----------------| | Simple tokens | 95-96% | Confident — minimal firing | | Complex tokens | 89-91% | More neurons recruited | | Training average | 91% | Healthy spike activity | Sparsity is **dynamic and input-dependent** — the model recruits more neurons for harder inputs, just like a biological brain. ## Comparison with Other SNN Language Models | Model | Params | From Scratch? | MoE | Zonal | Sparsity | |-------|--------|:---:|:---:|:---:|---| | **Nord v4.2** | 140M | ✅ | ✅ | ✅ | 91% | | Nord v3 | 144M | ✅ | ❌ | ❌ | 97% | | SpikeGPT | 216M | ✅ | ❌ | ❌ | ~90% | | SpikeLLM | 7-70B | ❌ | ❌ | ❌ | varies | | SpikeBERT | ~110M | ❌ | ❌ | ❌ | varies | ## Version History | Version | Key Change | Result | |---------|-----------|--------| | v3 | First SNN LLM | 97% sparsity, 51K Reddit views | | v3.5 | Scale to 500M | Failed — sparsity stuck at 100% | | v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 | | **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** | ## Limitations - Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0) - Coherence degrades after 2-3 sentences at 140M scale - Multilingual leakage in long generations (dataset artifact) - Scaling beyond 140M untested for v4.2 - No formal benchmark evaluation yet - Hallucination present ## Scaling Hypothesis If zonal specialization persists at scale, an 86B SNN could potentially: - Match 86B transformer quality - Run inference with compute of a 3-4B dense model (96% sparsity) - Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings This is unproven. The roadmap: 140M → 500M → 1-2B, testing at each scale. ## Citation ```bibtex @software{nord2026, title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization}, author={Zemondsa}, year={2026}, url={https://github.com/zemondsa/nord-ai} } ``` ## About Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding — just a rented A5000 and curiosity.