Nord-AI / README.md
zerdovzad's picture
Update README.md
68c9b40 verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- snn
- spiking-neural-network
- neuromorphic
- language-model
- from-scratch
- energy-efficient
- mixture-of-experts
- brain-inspired
---
# ⚑ Nord v4.2 β€” Brain-Inspired Spiking Neural Network Language Model (140M)
**The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex β€” trained from scratch.**
## What's New in v4.2
Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** β€” sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.
| | v3 (previous) | v4.2 (current) |
|---|---|---|
| **Parameters** | 144M | 140M |
| **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) |
| **MoE** | None | Spike-driven, 4 experts top-2 |
| **Memory** | None | 128-neuron cortex, Ο„=0.99 |
| **Zonal architecture** | No | Yes (self-organizing) |
| **Loss at 39K steps** | ~4.9 | **4.3** |
| **Training speed** | Slower convergence | 35% faster to same loss |
## Model Description
Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** β€” with different brain-inspired zones specializing in different functions.
Trained **entirely from scratch** β€” no transformer teacher, no distillation, no ANN-to-SNN conversion.
## Key Features
| Feature | Details |
|---------|---------|
| Parameters | 139.9M |
| Architecture | Original brain-inspired zonal SNN |
| Zones | Sensory β†’ Association (MoE) β†’ Memory β†’ Executive |
| MoE | 4 spike-driven experts, top-2 routing |
| Memory | 128 persistent neurons, gated temporal attention |
| Sparsity | 89-95% (dynamic, input-dependent) |
| Timesteps | 10 (8 fast + 2 slow) |
| Training method | Surrogate gradients + spike homeostasis |
| Training data | ~2.2M samples, general English corpus |
| Training cost | ~$15 USD |
| Online learning | STDP available during inference |
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Temporal Spike Encoder β”‚
β”‚ Token β†’ 8 fast + 2 slow timestep currents β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Sensory Zone (2 blocks) rates: 8-10% β”‚
β”‚ Standard FFN + LIF, feature extraction β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Association Zone (2 blocks) rates: 10-14% β”‚
β”‚ Spike-Driven MoE (4 experts, top-2) + LIF β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Memory Cortex rates: 0.5-1% β”‚
β”‚ 128 neurons, Ο„=0.99, gated temporal attn β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Executive Zone (2 blocks) rates: 11-26% β”‚
β”‚ Standard FFN + LIF, decision & output β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Readout (EMA over membrane potential) β”‚
β”‚ β†’ LM Head β†’ vocabulary logits β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Key Components
- **Associative LIF Neurons** β€” Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
- **ATan Surrogate Gradient** β€” Differentiable spike function for backpropagation
- **Spike-Driven MoE** β€” Expert routing based on cluster spike-rate activity, not dense networks
- **Memory Cortex** β€” Persistent slow memory with multi-head temporal attention readout
- **Adaptive Spike Regulator** β€” Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
- **RoPE** β€” Rotary position embeddings for sequence position encoding
- **Synaptic Resonance Attention** β€” Temporal mixing over spike patterns (not naive flattening)
### Model Configuration
```python
d_model: 496
n_heads: 8
n_layers: 6 (2 sensory + 2 association + 2 executive)
d_ff: 1024
n_experts: 4
top_k_experts: 2
memory_size: 128
T_fast: 8, T_slow: 2
max_seq_len: 512
vocab_size: 128,256
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
```
## Emergent Zonal Specialization
The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates.
```
Zone Spike Rate Biological Analog
─────────────────────────────────────────────────────
Sensory 8-10% Primary sensory cortex
Association 10-14% Parietal/temporal cortex
Memory Cortex 0.5-1% Hippocampus (selective)
Executive [0] 11-15% Premotor cortex
Executive [1] 22-26% Prefrontal cortex
─────────────────────────────────────────────────────
```
This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.
## Training
- **Dataset:** ~2.2M text samples, general English corpus
- **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai)
- **Optimizer:** AdamW (lr=3e-4 β†’ 1e-5 cosine decay, weight_decay=0.01)
- **Batch size:** 2 Γ— grad_accum=16 (effective 32)
- **Sequence length:** 512
### Loss Progression
| Step | Loss | Sparsity | LR | Event |
|------|------|----------|-----|-------|
| 0 | 8.9 | 68% | warmup | Start |
| 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent |
| 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying |
| 14,000 | 7.6β†’5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival |
| 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 |
| 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay |
| 39,000 | 4.30 | 91% | 6.0e-05 | Current best |
### Parameter Breakdown
| Component | Parameters |
|-----------|-----------|
| Sensory Zone | 4.0M (2 blocks) |
| Association Zone | 4.1M (2 blocks, MoE) |
| Memory Cortex | 0.2M |
| Executive Zone | 4.0M (2 blocks) |
| Encoder + Readout + LM Head | ~127.6M |
| **Total** | **139.9M** |
## Usage
```python
import torch
from nord_core_v4 import NordConfig, NordModel
from transformers import AutoTokenizer
# Load
ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
cfg = NordConfig(**ckpt["config"])
model = NordModel(cfg).cuda()
# Filter persistent state buffers (size varies with batch)
state = {k: v for k, v in ckpt["model_state_dict"].items()
if "_v_mem_state" not in k and "_i_syn_state" not in k}
model.load_state_dict(state, strict=False)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
```
Or use the interactive chat:
```bash
python chat_v4.py
# Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit
```
## Generation Examples
**Step 3,600 (loss 5.5)** β€” no coherence:
> "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."
**Step 29,000 (loss 4.5)** β€” topic understanding, broken logic:
> "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."
**Step 39,000 (loss 4.3)** β€” thematic coherence, real entities:
> "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."
## Spike Dynamics
| Context | Sparsity | Interpretation |
|---------|----------|----------------|
| Simple tokens | 95-96% | Confident β€” minimal firing |
| Complex tokens | 89-91% | More neurons recruited |
| Training average | 91% | Healthy spike activity |
Sparsity is **dynamic and input-dependent** β€” the model recruits more neurons for harder inputs, just like a biological brain.
## Comparison with Other SNN Language Models
| Model | Params | From Scratch? | MoE | Zonal | Sparsity |
|-------|--------|:---:|:---:|:---:|---|
| **Nord v4.2** | 140M | βœ… | βœ… | βœ… | 91% |
| Nord v3 | 144M | βœ… | ❌ | ❌ | 97% |
| SpikeGPT | 216M | βœ… | ❌ | ❌ | ~90% |
| SpikeLLM | 7-70B | ❌ | ❌ | ❌ | varies |
| SpikeBERT | ~110M | ❌ | ❌ | ❌ | varies |
## Version History
| Version | Key Change | Result |
|---------|-----------|--------|
| v3 | First SNN LLM | 97% sparsity, 51K Reddit views |
| v3.5 | Scale to 500M | Failed β€” sparsity stuck at 100% |
| v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 |
| **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** |
## Limitations
- Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
- Coherence degrades after 2-3 sentences at 140M scale
- Multilingual leakage in long generations (dataset artifact)
- Scaling beyond 140M untested for v4.2
- No formal benchmark evaluation yet
- Hallucination present
## Scaling Hypothesis
If zonal specialization persists at scale, an 86B SNN could potentially:
- Match 86B transformer quality
- Run inference with compute of a 3-4B dense model (96% sparsity)
- Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings
This is unproven. The roadmap: 140M β†’ 500M β†’ 1-2B, testing at each scale.
## Citation
```bibtex
@software{nord2026,
title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
author={Zemondsa},
year={2026},
url={https://github.com/zemondsa/nord-ai}
}
```
## About
Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding β€” just a rented A5000 and curiosity.