---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- snn
- spiking-neural-network
- neuromorphic
- language-model
- from-scratch
- energy-efficient
- mixture-of-experts
- brain-inspired
---

# ⚡ Nord v4.2 — Brain-Inspired Spiking Neural Network Language Model (140M)

**The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex — trained from scratch.**

## What's New in v4.2

Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** — sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.

| | v3 (previous) | v4.2 (current) |
|---|---|---|
| **Parameters** | 144M | 140M |
| **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) |
| **MoE** | None | Spike-driven, 4 experts top-2 |
| **Memory** | None | 128-neuron cortex, τ=0.99 |
| **Zonal architecture** | No | Yes (self-organizing) |
| **Loss at 39K steps** | ~4.9 | **4.3** |
| **Training speed** | Slower convergence | 35% faster to same loss |

## Model Description

Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** — with different brain-inspired zones specializing in different functions.

Trained **entirely from scratch** — no transformer teacher, no distillation, no ANN-to-SNN conversion.

## Key Features

| Feature | Details |
|---------|---------|
| Parameters | 139.9M |
| Architecture | Original brain-inspired zonal SNN |
| Zones | Sensory → Association (MoE) → Memory → Executive |
| MoE | 4 spike-driven experts, top-2 routing |
| Memory | 128 persistent neurons, gated temporal attention |
| Sparsity | 89-95% (dynamic, input-dependent) |
| Timesteps | 10 (8 fast + 2 slow) |
| Training method | Surrogate gradients + spike homeostasis |
| Training data | ~2.2M samples, general English corpus |
| Training cost | ~$15 USD |
| Online learning | STDP available during inference |

## Architecture

```
┌───────────────────────────────────────────────┐
│  Temporal Spike Encoder                       │
│  Token → 8 fast + 2 slow timestep currents    │
├───────────────────────────────────────────────┤
│  Sensory Zone (2 blocks)     rates: 8-10%     │
│  Standard FFN + LIF, feature extraction       │
├───────────────────────────────────────────────┤
│  Association Zone (2 blocks) rates: 10-14%    │
│  Spike-Driven MoE (4 experts, top-2) + LIF   │
├───────────────────────────────────────────────┤
│  Memory Cortex               rates: 0.5-1%    │
│  128 neurons, τ=0.99, gated temporal attn     │
├───────────────────────────────────────────────┤
│  Executive Zone (2 blocks)   rates: 11-26%    │
│  Standard FFN + LIF, decision & output        │
├───────────────────────────────────────────────┤
│  Readout (EMA over membrane potential)        │
│  → LM Head → vocabulary logits                │
└───────────────────────────────────────────────┘
```

### Key Components

- **Associative LIF Neurons** — Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
- **ATan Surrogate Gradient** — Differentiable spike function for backpropagation
- **Spike-Driven MoE** — Expert routing based on cluster spike-rate activity, not dense networks
- **Memory Cortex** — Persistent slow memory with multi-head temporal attention readout
- **Adaptive Spike Regulator** — Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
- **RoPE** — Rotary position embeddings for sequence position encoding
- **Synaptic Resonance Attention** — Temporal mixing over spike patterns (not naive flattening)

### Model Configuration

```python
d_model: 496
n_heads: 8
n_layers: 6 (2 sensory + 2 association + 2 executive)
d_ff: 1024
n_experts: 4
top_k_experts: 2
memory_size: 128
T_fast: 8, T_slow: 2
max_seq_len: 512
vocab_size: 128,256
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
```

## Emergent Zonal Specialization

The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates.

```
Zone              Spike Rate    Biological Analog
─────────────────────────────────────────────────────
Sensory           8-10%         Primary sensory cortex
Association       10-14%        Parietal/temporal cortex
Memory Cortex     0.5-1%        Hippocampus (selective)
Executive [0]     11-15%        Premotor cortex
Executive [1]     22-26%        Prefrontal cortex
─────────────────────────────────────────────────────
```

This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.

## Training

- **Dataset:** ~2.2M text samples, general English corpus
- **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai)
- **Optimizer:** AdamW (lr=3e-4 → 1e-5 cosine decay, weight_decay=0.01)
- **Batch size:** 2 × grad_accum=16 (effective 32)
- **Sequence length:** 512

### Loss Progression

| Step | Loss | Sparsity | LR | Event |
|------|------|----------|-----|-------|
| 0 | 8.9 | 68% | warmup | Start |
| 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent |
| 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying |
| 14,000 | 7.6→5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival |
| 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 |
| 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay |
| 39,000 | 4.30 | 91% | 6.0e-05 | Current best |

### Parameter Breakdown

| Component | Parameters |
|-----------|-----------|
| Sensory Zone | 4.0M (2 blocks) |
| Association Zone | 4.1M (2 blocks, MoE) |
| Memory Cortex | 0.2M |
| Executive Zone | 4.0M (2 blocks) |
| Encoder + Readout + LM Head | ~127.6M |
| **Total** | **139.9M** |

## Usage

```python
import torch
from nord_core_v4 import NordConfig, NordModel
from transformers import AutoTokenizer

# Load
ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
cfg = NordConfig(**ckpt["config"])
model = NordModel(cfg).cuda()

# Filter persistent state buffers (size varies with batch)
state = {k: v for k, v in ckpt["model_state_dict"].items()
         if "_v_mem_state" not in k and "_i_syn_state" not in k}
model.load_state_dict(state, strict=False)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
```

Or use the interactive chat:

```bash
python chat_v4.py
# Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit
```

## Generation Examples

**Step 3,600 (loss 5.5)** — no coherence:
> "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."

**Step 29,000 (loss 4.5)** — topic understanding, broken logic:
> "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."

**Step 39,000 (loss 4.3)** — thematic coherence, real entities:
> "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."

## Spike Dynamics

| Context | Sparsity | Interpretation |
|---------|----------|----------------|
| Simple tokens | 95-96% | Confident — minimal firing |
| Complex tokens | 89-91% | More neurons recruited |
| Training average | 91% | Healthy spike activity |

Sparsity is **dynamic and input-dependent** — the model recruits more neurons for harder inputs, just like a biological brain.

## Comparison with Other SNN Language Models

| Model | Params | From Scratch? | MoE | Zonal | Sparsity |
|-------|--------|:---:|:---:|:---:|---|
| **Nord v4.2** | 140M | ✅ | ✅ | ✅ | 91% |
| Nord v3 | 144M | ✅ | ❌ | ❌ | 97% |
| SpikeGPT | 216M | ✅ | ❌ | ❌ | ~90% |
| SpikeLLM | 7-70B | ❌ | ❌ | ❌ | varies |
| SpikeBERT | ~110M | ❌ | ❌ | ❌ | varies |

## Version History

| Version | Key Change | Result |
|---------|-----------|--------|
| v3 | First SNN LLM | 97% sparsity, 51K Reddit views |
| v3.5 | Scale to 500M | Failed — sparsity stuck at 100% |
| v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 |
| **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** |

## Limitations

- Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
- Coherence degrades after 2-3 sentences at 140M scale
- Multilingual leakage in long generations (dataset artifact)
- Scaling beyond 140M untested for v4.2
- No formal benchmark evaluation yet
- Hallucination present

## Scaling Hypothesis

If zonal specialization persists at scale, an 86B SNN could potentially:
- Match 86B transformer quality
- Run inference with compute of a 3-4B dense model (96% sparsity)
- Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings

This is unproven. The roadmap: 140M → 500M → 1-2B, testing at each scale.

## Citation

```bibtex
@software{nord2026,
  title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
  author={Zemondsa},
  year={2026},
  url={https://github.com/zemondsa/nord-ai}
}
```

## About

Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding — just a rented A5000 and curiosity.