| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | tags: |
| | - snn |
| | - spiking-neural-network |
| | - neuromorphic |
| | - language-model |
| | - from-scratch |
| | - energy-efficient |
| | - mixture-of-experts |
| | - brain-inspired |
| | --- |
| | |
| | # β‘ Nord v4.2 β Brain-Inspired Spiking Neural Network Language Model (140M) |
| |
|
| | **The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex β trained from scratch.** |
| |
|
| | ## What's New in v4.2 |
| |
|
| | Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** β sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision. |
| |
|
| | | | v3 (previous) | v4.2 (current) | |
| | |---|---|---| |
| | | **Parameters** | 144M | 140M | |
| | | **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) | |
| | | **MoE** | None | Spike-driven, 4 experts top-2 | |
| | | **Memory** | None | 128-neuron cortex, Ο=0.99 | |
| | | **Zonal architecture** | No | Yes (self-organizing) | |
| | | **Loss at 39K steps** | ~4.9 | **4.3** | |
| | | **Training speed** | Slower convergence | 35% faster to same loss | |
| |
|
| | ## Model Description |
| |
|
| | Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** β with different brain-inspired zones specializing in different functions. |
| |
|
| | Trained **entirely from scratch** β no transformer teacher, no distillation, no ANN-to-SNN conversion. |
| |
|
| | ## Key Features |
| |
|
| | | Feature | Details | |
| | |---------|---------| |
| | | Parameters | 139.9M | |
| | | Architecture | Original brain-inspired zonal SNN | |
| | | Zones | Sensory β Association (MoE) β Memory β Executive | |
| | | MoE | 4 spike-driven experts, top-2 routing | |
| | | Memory | 128 persistent neurons, gated temporal attention | |
| | | Sparsity | 89-95% (dynamic, input-dependent) | |
| | | Timesteps | 10 (8 fast + 2 slow) | |
| | | Training method | Surrogate gradients + spike homeostasis | |
| | | Training data | ~2.2M samples, general English corpus | |
| | | Training cost | ~$15 USD | |
| | | Online learning | STDP available during inference | |
| |
|
| | ## Architecture |
| |
|
| | ``` |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ |
| | β Temporal Spike Encoder β |
| | β Token β 8 fast + 2 slow timestep currents β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| | β Sensory Zone (2 blocks) rates: 8-10% β |
| | β Standard FFN + LIF, feature extraction β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| | β Association Zone (2 blocks) rates: 10-14% β |
| | β Spike-Driven MoE (4 experts, top-2) + LIF β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| | β Memory Cortex rates: 0.5-1% β |
| | β 128 neurons, Ο=0.99, gated temporal attn β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| | β Executive Zone (2 blocks) rates: 11-26% β |
| | β Standard FFN + LIF, decision & output β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| | β Readout (EMA over membrane potential) β |
| | β β LM Head β vocabulary logits β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββ |
| | ``` |
| |
|
| | ### Key Components |
| |
|
| | - **Associative LIF Neurons** β Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters |
| | - **ATan Surrogate Gradient** β Differentiable spike function for backpropagation |
| | - **Spike-Driven MoE** β Expert routing based on cluster spike-rate activity, not dense networks |
| | - **Memory Cortex** β Persistent slow memory with multi-head temporal attention readout |
| | - **Adaptive Spike Regulator** β Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1% |
| | - **RoPE** β Rotary position embeddings for sequence position encoding |
| | - **Synaptic Resonance Attention** β Temporal mixing over spike patterns (not naive flattening) |
| |
|
| | ### Model Configuration |
| |
|
| | ```python |
| | d_model: 496 |
| | n_heads: 8 |
| | n_layers: 6 (2 sensory + 2 association + 2 executive) |
| | d_ff: 1024 |
| | n_experts: 4 |
| | top_k_experts: 2 |
| | memory_size: 128 |
| | T_fast: 8, T_slow: 2 |
| | max_seq_len: 512 |
| | vocab_size: 128,256 |
| | tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B) |
| | ``` |
| |
|
| | ## Emergent Zonal Specialization |
| |
|
| | The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates. |
| |
|
| | ``` |
| | Zone Spike Rate Biological Analog |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | Sensory 8-10% Primary sensory cortex |
| | Association 10-14% Parietal/temporal cortex |
| | Memory Cortex 0.5-1% Hippocampus (selective) |
| | Executive [0] 11-15% Premotor cortex |
| | Executive [1] 22-26% Prefrontal cortex |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | ``` |
| |
|
| | This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex. |
| |
|
| | ## Training |
| |
|
| | - **Dataset:** ~2.2M text samples, general English corpus |
| | - **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai) |
| | - **Optimizer:** AdamW (lr=3e-4 β 1e-5 cosine decay, weight_decay=0.01) |
| | - **Batch size:** 2 Γ grad_accum=16 (effective 32) |
| | - **Sequence length:** 512 |
| |
|
| | ### Loss Progression |
| |
|
| | | Step | Loss | Sparsity | LR | Event | |
| | |------|------|----------|-----|-------| |
| | | 0 | 8.9 | 68% | warmup | Start | |
| | | 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent | |
| | | 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying | |
| | | 14,000 | 7.6β5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival | |
| | | 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 | |
| | | 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay | |
| | | 39,000 | 4.30 | 91% | 6.0e-05 | Current best | |
| |
|
| | ### Parameter Breakdown |
| |
|
| | | Component | Parameters | |
| | |-----------|-----------| |
| | | Sensory Zone | 4.0M (2 blocks) | |
| | | Association Zone | 4.1M (2 blocks, MoE) | |
| | | Memory Cortex | 0.2M | |
| | | Executive Zone | 4.0M (2 blocks) | |
| | | Encoder + Readout + LM Head | ~127.6M | |
| | | **Total** | **139.9M** | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from nord_core_v4 import NordConfig, NordModel |
| | from transformers import AutoTokenizer |
| | |
| | # Load |
| | ckpt = torch.load("nord_v4_latest.pt", map_location="cuda") |
| | cfg = NordConfig(**ckpt["config"]) |
| | model = NordModel(cfg).cuda() |
| | |
| | # Filter persistent state buffers (size varies with batch) |
| | state = {k: v for k, v in ckpt["model_state_dict"].items() |
| | if "_v_mem_state" not in k and "_i_syn_state" not in k} |
| | model.load_state_dict(state, strict=False) |
| | model.eval() |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") |
| | ``` |
| |
|
| | Or use the interactive chat: |
| |
|
| | ```bash |
| | python chat_v4.py |
| | # Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit |
| | ``` |
| |
|
| | ## Generation Examples |
| |
|
| | **Step 3,600 (loss 5.5)** β no coherence: |
| | > "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees." |
| |
|
| | **Step 29,000 (loss 4.5)** β topic understanding, broken logic: |
| | > "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems." |
| |
|
| | **Step 39,000 (loss 4.3)** β thematic coherence, real entities: |
| | > "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..." |
| |
|
| | ## Spike Dynamics |
| |
|
| | | Context | Sparsity | Interpretation | |
| | |---------|----------|----------------| |
| | | Simple tokens | 95-96% | Confident β minimal firing | |
| | | Complex tokens | 89-91% | More neurons recruited | |
| | | Training average | 91% | Healthy spike activity | |
| |
|
| | Sparsity is **dynamic and input-dependent** β the model recruits more neurons for harder inputs, just like a biological brain. |
| |
|
| | ## Comparison with Other SNN Language Models |
| |
|
| | | Model | Params | From Scratch? | MoE | Zonal | Sparsity | |
| | |-------|--------|:---:|:---:|:---:|---| |
| | | **Nord v4.2** | 140M | β
| β
| β
| 91% | |
| | | Nord v3 | 144M | β
| β | β | 97% | |
| | | SpikeGPT | 216M | β
| β | β | ~90% | |
| | | SpikeLLM | 7-70B | β | β | β | varies | |
| | | SpikeBERT | ~110M | β | β | β | varies | |
| |
|
| | ## Version History |
| |
|
| | | Version | Key Change | Result | |
| | |---------|-----------|--------| |
| | | v3 | First SNN LLM | 97% sparsity, 51K Reddit views | |
| | | v3.5 | Scale to 500M | Failed β sparsity stuck at 100% | |
| | | v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 | |
| | | **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** | |
| |
|
| | ## Limitations |
| |
|
| | - Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0) |
| | - Coherence degrades after 2-3 sentences at 140M scale |
| | - Multilingual leakage in long generations (dataset artifact) |
| | - Scaling beyond 140M untested for v4.2 |
| | - No formal benchmark evaluation yet |
| | - Hallucination present |
| |
|
| | ## Scaling Hypothesis |
| |
|
| | If zonal specialization persists at scale, an 86B SNN could potentially: |
| | - Match 86B transformer quality |
| | - Run inference with compute of a 3-4B dense model (96% sparsity) |
| | - Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings |
| |
|
| | This is unproven. The roadmap: 140M β 500M β 1-2B, testing at each scale. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @software{nord2026, |
| | title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization}, |
| | author={Zemondsa}, |
| | year={2026}, |
| | url={https://github.com/zemondsa/nord-ai} |
| | } |
| | ``` |
| |
|
| | ## About |
| |
|
| | Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding β just a rented A5000 and curiosity. |