Update README.md
Browse files
README.md
CHANGED
|
@@ -5,85 +5,169 @@ language:
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
tags:
|
| 7 |
- snn
|
| 8 |
-
---
|
| 9 |
-
language:
|
| 10 |
-
- en
|
| 11 |
-
tags:
|
| 12 |
- spiking-neural-network
|
| 13 |
-
- SNN
|
| 14 |
- neuromorphic
|
| 15 |
- language-model
|
| 16 |
- from-scratch
|
| 17 |
- energy-efficient
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# β‘ Nord β Spiking Neural Network Language Model (
|
| 21 |
|
| 22 |
-
**The first
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Model Description
|
| 25 |
|
| 26 |
-
Nord is a
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Key Features
|
| 29 |
|
| 30 |
| Feature | Details |
|
| 31 |
|---------|---------|
|
| 32 |
-
| Parameters |
|
| 33 |
-
| Architecture | Original
|
| 34 |
-
|
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
| Sparsity
|
| 38 |
-
|
|
| 39 |
-
|
|
| 40 |
-
| Training
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Architecture
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
- **Associative
|
| 49 |
-
- **
|
| 50 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
### Model Configuration
|
| 53 |
|
| 54 |
-
```
|
| 55 |
-
d_model:
|
| 56 |
-
n_layers: 6
|
| 57 |
n_heads: 8
|
|
|
|
| 58 |
d_ff: 1024
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
max_seq_len: 512
|
| 62 |
vocab_size: 128,256
|
| 63 |
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
|
| 64 |
```
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
## Training
|
| 67 |
|
| 68 |
-
- **Dataset:**
|
| 69 |
-
- **Hardware:**
|
| 70 |
-
- **Optimizer:** AdamW (lr=
|
| 71 |
-
- **Batch size:**
|
| 72 |
- **Sequence length:** 512
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
## Usage
|
| 76 |
|
| 77 |
```python
|
| 78 |
import torch
|
| 79 |
-
from
|
| 80 |
from transformers import AutoTokenizer
|
| 81 |
|
| 82 |
# Load
|
| 83 |
-
ckpt = torch.load("
|
| 84 |
cfg = NordConfig(**ckpt["config"])
|
| 85 |
model = NordModel(cfg).cuda()
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
model.eval()
|
| 88 |
|
| 89 |
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
|
|
@@ -92,58 +176,79 @@ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
|
|
| 92 |
Or use the interactive chat:
|
| 93 |
|
| 94 |
```bash
|
| 95 |
-
python
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
-
## Generation Examples
|
| 99 |
-
|
| 100 |
-
**Prompt:** "If you don't write properly I will delete your file"
|
| 101 |
|
| 102 |
-
**
|
|
|
|
| 103 |
|
| 104 |
-
**
|
|
|
|
| 105 |
|
| 106 |
-
**
|
|
|
|
| 107 |
|
| 108 |
-
## Spike
|
| 109 |
|
| 110 |
| Context | Sparsity | Interpretation |
|
| 111 |
|---------|----------|----------------|
|
| 112 |
-
|
|
| 113 |
-
|
|
| 114 |
-
|
|
| 115 |
|
| 116 |
-
Sparsity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
## Limitations
|
| 119 |
|
| 120 |
-
-
|
| 121 |
-
-
|
| 122 |
-
-
|
|
|
|
| 123 |
- No formal benchmark evaluation yet
|
| 124 |
-
- Hallucination present
|
| 125 |
|
| 126 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|-------|--------|:-------------:|-------------|
|
| 130 |
-
| **Nord** | 144M | β
| Fully original |
|
| 131 |
-
| SpikeGPT | 216M | β
| Modified RWKV |
|
| 132 |
-
| SpikeLLM | 7-70B | β | Converted LLaMA |
|
| 133 |
-
| SpikeBERT | ~110M | β | Distilled from BERT |
|
| 134 |
-
| BrainTransformers | 3B | β | Converted Qwen2 |
|
| 135 |
|
| 136 |
## Citation
|
| 137 |
|
| 138 |
```bibtex
|
| 139 |
-
@
|
| 140 |
-
title={Nord:
|
| 141 |
-
author={
|
| 142 |
-
year={
|
| 143 |
-
url={https://github.com/
|
| 144 |
}
|
| 145 |
```
|
| 146 |
|
| 147 |
## About
|
| 148 |
|
| 149 |
-
Built by an 18-year-old
|
|
|
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
tags:
|
| 7 |
- snn
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- spiking-neural-network
|
|
|
|
| 9 |
- neuromorphic
|
| 10 |
- language-model
|
| 11 |
- from-scratch
|
| 12 |
- energy-efficient
|
| 13 |
+
- mixture-of-experts
|
| 14 |
+
- brain-inspired
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# β‘ Nord v4.2 β Brain-Inspired Spiking Neural Network Language Model (140M)
|
| 18 |
|
| 19 |
+
**The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex β trained from scratch.**
|
| 20 |
+
|
| 21 |
+
## What's New in v4.2
|
| 22 |
+
|
| 23 |
+
Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** β sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.
|
| 24 |
+
|
| 25 |
+
| | v3 (previous) | v4.2 (current) |
|
| 26 |
+
|---|---|---|
|
| 27 |
+
| **Parameters** | 144M | 140M |
|
| 28 |
+
| **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) |
|
| 29 |
+
| **MoE** | None | Spike-driven, 4 experts top-2 |
|
| 30 |
+
| **Memory** | None | 128-neuron cortex, Ο=0.99 |
|
| 31 |
+
| **Zonal architecture** | No | Yes (self-organizing) |
|
| 32 |
+
| **Loss at 39K steps** | ~4.9 | **4.3** |
|
| 33 |
+
| **Training speed** | Slower convergence | 35% faster to same loss |
|
| 34 |
|
| 35 |
## Model Description
|
| 36 |
|
| 37 |
+
Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** β with different brain-inspired zones specializing in different functions.
|
| 38 |
+
|
| 39 |
+
Trained **entirely from scratch** β no transformer teacher, no distillation, no ANN-to-SNN conversion.
|
| 40 |
|
| 41 |
## Key Features
|
| 42 |
|
| 43 |
| Feature | Details |
|
| 44 |
|---------|---------|
|
| 45 |
+
| Parameters | 139.9M |
|
| 46 |
+
| Architecture | Original brain-inspired zonal SNN |
|
| 47 |
+
| Zones | Sensory β Association (MoE) β Memory β Executive |
|
| 48 |
+
| MoE | 4 spike-driven experts, top-2 routing |
|
| 49 |
+
| Memory | 128 persistent neurons, gated temporal attention |
|
| 50 |
+
| Sparsity | 89-95% (dynamic, input-dependent) |
|
| 51 |
+
| Timesteps | 10 (8 fast + 2 slow) |
|
| 52 |
+
| Training method | Surrogate gradients + spike homeostasis |
|
| 53 |
+
| Training data | ~2.2M samples, general English corpus |
|
| 54 |
+
| Training cost | ~$15 USD |
|
| 55 |
+
| Online learning | STDP available during inference |
|
| 56 |
|
| 57 |
## Architecture
|
| 58 |
|
| 59 |
+
```
|
| 60 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
+
β Temporal Spike Encoder β
|
| 62 |
+
β Token β 8 fast + 2 slow timestep currents β
|
| 63 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 64 |
+
β Sensory Zone (2 blocks) rates: 8-10% β
|
| 65 |
+
β Standard FFN + LIF, feature extraction β
|
| 66 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 67 |
+
β Association Zone (2 blocks) rates: 10-14% β
|
| 68 |
+
β Spike-Driven MoE (4 experts, top-2) + LIF β
|
| 69 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 70 |
+
β Memory Cortex rates: 0.5-1% β
|
| 71 |
+
β 128 neurons, Ο=0.99, gated temporal attn β
|
| 72 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 73 |
+
β Executive Zone (2 blocks) rates: 11-26% β
|
| 74 |
+
β Standard FFN + LIF, decision & output β
|
| 75 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 76 |
+
β Readout (EMA over membrane potential) β
|
| 77 |
+
β β LM Head β vocabulary logits β
|
| 78 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 79 |
+
```
|
| 80 |
|
| 81 |
+
### Key Components
|
| 82 |
+
|
| 83 |
+
- **Associative LIF Neurons** β Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
|
| 84 |
+
- **ATan Surrogate Gradient** β Differentiable spike function for backpropagation
|
| 85 |
+
- **Spike-Driven MoE** β Expert routing based on cluster spike-rate activity, not dense networks
|
| 86 |
+
- **Memory Cortex** β Persistent slow memory with multi-head temporal attention readout
|
| 87 |
+
- **Adaptive Spike Regulator** β Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
|
| 88 |
+
- **RoPE** β Rotary position embeddings for sequence position encoding
|
| 89 |
+
- **Synaptic Resonance Attention** β Temporal mixing over spike patterns (not naive flattening)
|
| 90 |
|
| 91 |
### Model Configuration
|
| 92 |
|
| 93 |
+
```python
|
| 94 |
+
d_model: 496
|
|
|
|
| 95 |
n_heads: 8
|
| 96 |
+
n_layers: 6 (2 sensory + 2 association + 2 executive)
|
| 97 |
d_ff: 1024
|
| 98 |
+
n_experts: 4
|
| 99 |
+
top_k_experts: 2
|
| 100 |
+
memory_size: 128
|
| 101 |
+
T_fast: 8, T_slow: 2
|
| 102 |
max_seq_len: 512
|
| 103 |
vocab_size: 128,256
|
| 104 |
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
|
| 105 |
```
|
| 106 |
|
| 107 |
+
## Emergent Zonal Specialization
|
| 108 |
+
|
| 109 |
+
The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates.
|
| 110 |
+
|
| 111 |
+
```
|
| 112 |
+
Zone Spike Rate Biological Analog
|
| 113 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 114 |
+
Sensory 8-10% Primary sensory cortex
|
| 115 |
+
Association 10-14% Parietal/temporal cortex
|
| 116 |
+
Memory Cortex 0.5-1% Hippocampus (selective)
|
| 117 |
+
Executive [0] 11-15% Premotor cortex
|
| 118 |
+
Executive [1] 22-26% Prefrontal cortex
|
| 119 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.
|
| 123 |
+
|
| 124 |
## Training
|
| 125 |
|
| 126 |
+
- **Dataset:** ~2.2M text samples, general English corpus
|
| 127 |
+
- **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai)
|
| 128 |
+
- **Optimizer:** AdamW (lr=3e-4 β 1e-5 cosine decay, weight_decay=0.01)
|
| 129 |
+
- **Batch size:** 2 Γ grad_accum=16 (effective 32)
|
| 130 |
- **Sequence length:** 512
|
| 131 |
+
|
| 132 |
+
### Loss Progression
|
| 133 |
+
|
| 134 |
+
| Step | Loss | Sparsity | LR | Event |
|
| 135 |
+
|------|------|----------|-----|-------|
|
| 136 |
+
| 0 | 8.9 | 68% | warmup | Start |
|
| 137 |
+
| 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent |
|
| 138 |
+
| 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying |
|
| 139 |
+
| 14,000 | 7.6β5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival |
|
| 140 |
+
| 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 |
|
| 141 |
+
| 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay |
|
| 142 |
+
| 39,000 | 4.30 | 91% | 6.0e-05 | Current best |
|
| 143 |
+
|
| 144 |
+
### Parameter Breakdown
|
| 145 |
+
|
| 146 |
+
| Component | Parameters |
|
| 147 |
+
|-----------|-----------|
|
| 148 |
+
| Sensory Zone | 4.0M (2 blocks) |
|
| 149 |
+
| Association Zone | 4.1M (2 blocks, MoE) |
|
| 150 |
+
| Memory Cortex | 0.2M |
|
| 151 |
+
| Executive Zone | 4.0M (2 blocks) |
|
| 152 |
+
| Encoder + Readout + LM Head | ~127.6M |
|
| 153 |
+
| **Total** | **139.9M** |
|
| 154 |
|
| 155 |
## Usage
|
| 156 |
|
| 157 |
```python
|
| 158 |
import torch
|
| 159 |
+
from nord_core_v4 import NordConfig, NordModel
|
| 160 |
from transformers import AutoTokenizer
|
| 161 |
|
| 162 |
# Load
|
| 163 |
+
ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
|
| 164 |
cfg = NordConfig(**ckpt["config"])
|
| 165 |
model = NordModel(cfg).cuda()
|
| 166 |
+
|
| 167 |
+
# Filter persistent state buffers (size varies with batch)
|
| 168 |
+
state = {k: v for k, v in ckpt["model_state_dict"].items()
|
| 169 |
+
if "_v_mem_state" not in k and "_i_syn_state" not in k}
|
| 170 |
+
model.load_state_dict(state, strict=False)
|
| 171 |
model.eval()
|
| 172 |
|
| 173 |
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
|
|
|
|
| 176 |
Or use the interactive chat:
|
| 177 |
|
| 178 |
```bash
|
| 179 |
+
python chat_v4.py
|
| 180 |
+
# Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit
|
| 181 |
```
|
| 182 |
|
| 183 |
+
## Generation Examples
|
|
|
|
|
|
|
| 184 |
|
| 185 |
+
**Step 3,600 (loss 5.5)** β no coherence:
|
| 186 |
+
> "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."
|
| 187 |
|
| 188 |
+
**Step 29,000 (loss 4.5)** β topic understanding, broken logic:
|
| 189 |
+
> "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."
|
| 190 |
|
| 191 |
+
**Step 39,000 (loss 4.3)** β thematic coherence, real entities:
|
| 192 |
+
> "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."
|
| 193 |
|
| 194 |
+
## Spike Dynamics
|
| 195 |
|
| 196 |
| Context | Sparsity | Interpretation |
|
| 197 |
|---------|----------|----------------|
|
| 198 |
+
| Simple tokens | 95-96% | Confident β minimal firing |
|
| 199 |
+
| Complex tokens | 89-91% | More neurons recruited |
|
| 200 |
+
| Training average | 91% | Healthy spike activity |
|
| 201 |
|
| 202 |
+
Sparsity is **dynamic and input-dependent** β the model recruits more neurons for harder inputs, just like a biological brain.
|
| 203 |
+
|
| 204 |
+
## Comparison with Other SNN Language Models
|
| 205 |
+
|
| 206 |
+
| Model | Params | From Scratch? | MoE | Zonal | Sparsity |
|
| 207 |
+
|-------|--------|:---:|:---:|:---:|---|
|
| 208 |
+
| **Nord v4.2** | 140M | β
| β
| β
| 91% |
|
| 209 |
+
| Nord v3 | 144M | β
| β | β | 97% |
|
| 210 |
+
| SpikeGPT | 216M | β
| β | β | ~90% |
|
| 211 |
+
| SpikeLLM | 7-70B | β | β | β | varies |
|
| 212 |
+
| SpikeBERT | ~110M | β | β | β | varies |
|
| 213 |
+
|
| 214 |
+
## Version History
|
| 215 |
+
|
| 216 |
+
| Version | Key Change | Result |
|
| 217 |
+
|---------|-----------|--------|
|
| 218 |
+
| v3 | First SNN LLM | 97% sparsity, 51K Reddit views |
|
| 219 |
+
| v3.5 | Scale to 500M | Failed β sparsity stuck at 100% |
|
| 220 |
+
| v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 |
|
| 221 |
+
| **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** |
|
| 222 |
|
| 223 |
## Limitations
|
| 224 |
|
| 225 |
+
- Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
|
| 226 |
+
- Coherence degrades after 2-3 sentences at 140M scale
|
| 227 |
+
- Multilingual leakage in long generations (dataset artifact)
|
| 228 |
+
- Scaling beyond 140M untested for v4.2
|
| 229 |
- No formal benchmark evaluation yet
|
| 230 |
+
- Hallucination present
|
| 231 |
|
| 232 |
+
## Scaling Hypothesis
|
| 233 |
+
|
| 234 |
+
If zonal specialization persists at scale, an 86B SNN could potentially:
|
| 235 |
+
- Match 86B transformer quality
|
| 236 |
+
- Run inference with compute of a 3-4B dense model (96% sparsity)
|
| 237 |
+
- Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings
|
| 238 |
|
| 239 |
+
This is unproven. The roadmap: 140M β 500M β 1-2B, testing at each scale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
|
| 241 |
## Citation
|
| 242 |
|
| 243 |
```bibtex
|
| 244 |
+
@software{nord2026,
|
| 245 |
+
title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
|
| 246 |
+
author={Zemondsa},
|
| 247 |
+
year={2026},
|
| 248 |
+
url={https://github.com/zemondsa/nord-ai}
|
| 249 |
}
|
| 250 |
```
|
| 251 |
|
| 252 |
## About
|
| 253 |
|
| 254 |
+
Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding β just a rented A5000 and curiosity.
|