zerdovzad commited on
Commit
68c9b40
Β·
verified Β·
1 Parent(s): c1e0fc8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -66
README.md CHANGED
@@ -5,85 +5,169 @@ language:
5
  pipeline_tag: text-generation
6
  tags:
7
  - snn
8
- ---
9
- language:
10
- - en
11
- tags:
12
  - spiking-neural-network
13
- - SNN
14
  - neuromorphic
15
  - language-model
16
  - from-scratch
17
  - energy-efficient
 
 
18
  ---
19
 
20
- # ⚑ Nord β€” Spiking Neural Network Language Model (144M)
21
 
22
- **The first pure SNN language model with a fully original architecture, trained from scratch.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Model Description
25
 
26
- Nord is a 144M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired neurons with membrane potentials, firing thresholds, and binary spikes. Unlike other SNN language models, Nord was trained **entirely from scratch** β€” no transformer teacher, no distillation, no ANN-to-SNN conversion.
 
 
27
 
28
  ## Key Features
29
 
30
  | Feature | Details |
31
  |---------|---------|
32
- | Parameters | 144.3M |
33
- | Architecture | Original (not RWKV, not Transformer) |
34
- | Training method | From scratch with surrogate gradients |
35
- | Training data | FineWeb-Edu |
36
- | Sparsity (training) | 97% |
37
- | Sparsity (inference) | 97-99.8% |
38
- | Online learning | STDP active during inference |
39
- | Mobile deployment | Android via Termux |
40
- | Training cost | ~$10 USD |
 
 
41
 
42
  ## Architecture
43
 
44
- Nord combines five mechanisms from different subfields:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- - **LeakyClamp** β€” Prevents gradient death in deep SNN layers
47
- - **Multi-Scale Temporal Encoding** β€” T_fast=8 + T_slow=2 timesteps
48
- - **Associative Cascade** β€” Chain reactions keep sparse networks alive
49
- - **Temporal Co-firing Resonance** β€” Feature binding without attention
50
- - **Reward-Modulated STDP** β€” Aligns Hebbian learning with backprop
 
 
 
 
51
 
52
  ### Model Configuration
53
 
54
- ```
55
- d_model: 512
56
- n_layers: 6
57
  n_heads: 8
 
58
  d_ff: 1024
59
- T_fast: 8
60
- T_slow: 2
 
 
61
  max_seq_len: 512
62
  vocab_size: 128,256
63
  tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
64
  ```
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  ## Training
67
 
68
- - **Dataset:** FineWeb-Edu (~950M tokens, 10GB subset)
69
- - **Hardware:** RTX A5000 24GB (rented on Vast.ai at $0.177/hr)
70
- - **Optimizer:** AdamW (lr=6e-4, weight_decay=0.05)
71
- - **Batch size:** 48 (4 Γ— grad_accum=12)
72
  - **Sequence length:** 512
73
- - **Loss progression:** 6.28 (1k) β†’ 5.30 (5k) β†’ 5.00 (10k) β†’ 4.59 (34k)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Usage
76
 
77
  ```python
78
  import torch
79
- from nord_core import NordConfig, NordModel
80
  from transformers import AutoTokenizer
81
 
82
  # Load
83
- ckpt = torch.load("nord_latest.pt", map_location="cuda")
84
  cfg = NordConfig(**ckpt["config"])
85
  model = NordModel(cfg).cuda()
86
- model.load_state_dict(ckpt["model_state_dict"])
 
 
 
 
87
  model.eval()
88
 
89
  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
@@ -92,58 +176,79 @@ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
92
  Or use the interactive chat:
93
 
94
  ```bash
95
- python chat.py
 
96
  ```
97
 
98
- ## Generation Examples (Step 34k, loss 4.59)
99
-
100
- **Prompt:** "If you don't write properly I will delete your file"
101
 
102
- **Nord:** "(I'm really careful) that a project is pretty good for the following reasons... The file will then turn this data into your code. It's not an algorithm to say that all the file is a project called 'read'. In order to get a lot of information on you are using this program, and we'll have no more knowledge about them!"
 
103
 
104
- **Prompt:** "bro my laptop have problems and off wifi how on wifi"
 
105
 
106
- **Nord:** "The project has been carried out by researchers at a site in 2013, including three main projects that work with wireless phones... using Wi-Fi sensors to monitor their devices such as wireless LANs... network that would have been created using more than 2% of 4 million people to use Bluetooth technology."
 
107
 
108
- ## Spike Statistics
109
 
110
  | Context | Sparsity | Interpretation |
111
  |---------|----------|----------------|
112
- | Familiar topic | 99.8% | Confident β€” minimal neural activity |
113
- | Training | 97% | Active learning β€” neurons spiking |
114
- | Out-of-distribution | 77% | Uncertain β€” massive activation |
115
 
116
- Sparsity functions as a **built-in uncertainty detector** β€” no separate calibration needed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ## Limitations
119
 
120
- - Repetition remains an issue (mitigated with repetition penalty in decoding)
121
- - Not competitive with GPT-2 in raw quality
122
- - Scaling above 144M is untested
 
123
  - No formal benchmark evaluation yet
124
- - Hallucination present (generates plausible but fictional details)
125
 
126
- ## Comparison with Other SNN Language Models
 
 
 
 
 
127
 
128
- | Model | Params | From Scratch? | Architecture |
129
- |-------|--------|:-------------:|-------------|
130
- | **Nord** | 144M | βœ… | Fully original |
131
- | SpikeGPT | 216M | βœ… | Modified RWKV |
132
- | SpikeLLM | 7-70B | ❌ | Converted LLaMA |
133
- | SpikeBERT | ~110M | ❌ | Distilled from BERT |
134
- | BrainTransformers | 3B | ❌ | Converted Qwen2 |
135
 
136
  ## Citation
137
 
138
  ```bibtex
139
- @misc{nord2025,
140
- title={Nord: A Spiking Neural Network Language Model Trained from Scratch},
141
- author={zerdovzad},
142
- year={2025},
143
- url={https://github.com/gtausa197-svg/-Project-Nord-Spiking-Neural-Network-Language-Model}
144
  }
145
  ```
146
 
147
  ## About
148
 
149
- Built by an 18-year-old electronics student from Ukraine, studying in Norway. No PhD, no team, no funding.
 
5
  pipeline_tag: text-generation
6
  tags:
7
  - snn
 
 
 
 
8
  - spiking-neural-network
 
9
  - neuromorphic
10
  - language-model
11
  - from-scratch
12
  - energy-efficient
13
+ - mixture-of-experts
14
+ - brain-inspired
15
  ---
16
 
17
+ # ⚑ Nord v4.2 β€” Brain-Inspired Spiking Neural Network Language Model (140M)
18
 
19
+ **The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex β€” trained from scratch.**
20
+
21
+ ## What's New in v4.2
22
+
23
+ Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: **the model self-organizes into functionally distinct brain zones during training** β€” sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.
24
+
25
+ | | v3 (previous) | v4.2 (current) |
26
+ |---|---|---|
27
+ | **Parameters** | 144M | 140M |
28
+ | **Sparsity** | 97% (but spikes broken at scale) | 91% (spikes working) |
29
+ | **MoE** | None | Spike-driven, 4 experts top-2 |
30
+ | **Memory** | None | 128-neuron cortex, Ο„=0.99 |
31
+ | **Zonal architecture** | No | Yes (self-organizing) |
32
+ | **Loss at 39K steps** | ~4.9 | **4.3** |
33
+ | **Training speed** | Slower convergence | 35% faster to same loss |
34
 
35
  ## Model Description
36
 
37
+ Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only **3-9%** β€” with different brain-inspired zones specializing in different functions.
38
+
39
+ Trained **entirely from scratch** β€” no transformer teacher, no distillation, no ANN-to-SNN conversion.
40
 
41
  ## Key Features
42
 
43
  | Feature | Details |
44
  |---------|---------|
45
+ | Parameters | 139.9M |
46
+ | Architecture | Original brain-inspired zonal SNN |
47
+ | Zones | Sensory β†’ Association (MoE) β†’ Memory β†’ Executive |
48
+ | MoE | 4 spike-driven experts, top-2 routing |
49
+ | Memory | 128 persistent neurons, gated temporal attention |
50
+ | Sparsity | 89-95% (dynamic, input-dependent) |
51
+ | Timesteps | 10 (8 fast + 2 slow) |
52
+ | Training method | Surrogate gradients + spike homeostasis |
53
+ | Training data | ~2.2M samples, general English corpus |
54
+ | Training cost | ~$15 USD |
55
+ | Online learning | STDP available during inference |
56
 
57
  ## Architecture
58
 
59
+ ```
60
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
+ β”‚ Temporal Spike Encoder β”‚
62
+ β”‚ Token β†’ 8 fast + 2 slow timestep currents β”‚
63
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
64
+ β”‚ Sensory Zone (2 blocks) rates: 8-10% β”‚
65
+ β”‚ Standard FFN + LIF, feature extraction β”‚
66
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
67
+ β”‚ Association Zone (2 blocks) rates: 10-14% β”‚
68
+ β”‚ Spike-Driven MoE (4 experts, top-2) + LIF β”‚
69
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
70
+ β”‚ Memory Cortex rates: 0.5-1% β”‚
71
+ β”‚ 128 neurons, Ο„=0.99, gated temporal attn β”‚
72
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
73
+ β”‚ Executive Zone (2 blocks) rates: 11-26% β”‚
74
+ β”‚ Standard FFN + LIF, decision & output β”‚
75
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
76
+ β”‚ Readout (EMA over membrane potential) β”‚
77
+ β”‚ β†’ LM Head β†’ vocabulary logits β”‚
78
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
79
+ ```
80
 
81
+ ### Key Components
82
+
83
+ - **Associative LIF Neurons** β€” Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
84
+ - **ATan Surrogate Gradient** β€” Differentiable spike function for backpropagation
85
+ - **Spike-Driven MoE** β€” Expert routing based on cluster spike-rate activity, not dense networks
86
+ - **Memory Cortex** β€” Persistent slow memory with multi-head temporal attention readout
87
+ - **Adaptive Spike Regulator** β€” Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
88
+ - **RoPE** β€” Rotary position embeddings for sequence position encoding
89
+ - **Synaptic Resonance Attention** β€” Temporal mixing over spike patterns (not naive flattening)
90
 
91
  ### Model Configuration
92
 
93
+ ```python
94
+ d_model: 496
 
95
  n_heads: 8
96
+ n_layers: 6 (2 sensory + 2 association + 2 executive)
97
  d_ff: 1024
98
+ n_experts: 4
99
+ top_k_experts: 2
100
+ memory_size: 128
101
+ T_fast: 8, T_slow: 2
102
  max_seq_len: 512
103
  vocab_size: 128,256
104
  tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
105
  ```
106
 
107
+ ## Emergent Zonal Specialization
108
+
109
+ The most significant finding: **the model self-organizes functionally distinct zones** during standard training. No manual assignment, no hardcoded rates.
110
+
111
+ ```
112
+ Zone Spike Rate Biological Analog
113
+ ─────────────────────────────────────────────────────
114
+ Sensory 8-10% Primary sensory cortex
115
+ Association 10-14% Parietal/temporal cortex
116
+ Memory Cortex 0.5-1% Hippocampus (selective)
117
+ Executive [0] 11-15% Premotor cortex
118
+ Executive [1] 22-26% Prefrontal cortex
119
+ ─────────────────────────────────────────────────────
120
+ ```
121
+
122
+ This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.
123
+
124
  ## Training
125
 
126
+ - **Dataset:** ~2.2M text samples, general English corpus
127
+ - **Hardware:** NVIDIA A5000 24GB (rented on Vast.ai)
128
+ - **Optimizer:** AdamW (lr=3e-4 β†’ 1e-5 cosine decay, weight_decay=0.01)
129
+ - **Batch size:** 2 Γ— grad_accum=16 (effective 32)
130
  - **Sequence length:** 512
131
+
132
+ ### Loss Progression
133
+
134
+ | Step | Loss | Sparsity | LR | Event |
135
+ |------|------|----------|-----|-------|
136
+ | 0 | 8.9 | 68% | warmup | Start |
137
+ | 1,500 | 6.2 | 69% | 3.0e-04 | Rapid descent |
138
+ | 10,000 | 4.95 | 99% | 3.0e-04 | v4.1 plateau, spikes dying |
139
+ | 14,000 | 7.6β†’5.2 | 75% | 3.0e-04 | v4.2 fixes, spike revival |
140
+ | 20,000 | 4.70 | 91% | 3.0e-04 | Surpassed v4.1 |
141
+ | 30,000 | 4.50 | 91% | 1.2e-04 | Cosine decay |
142
+ | 39,000 | 4.30 | 91% | 6.0e-05 | Current best |
143
+
144
+ ### Parameter Breakdown
145
+
146
+ | Component | Parameters |
147
+ |-----------|-----------|
148
+ | Sensory Zone | 4.0M (2 blocks) |
149
+ | Association Zone | 4.1M (2 blocks, MoE) |
150
+ | Memory Cortex | 0.2M |
151
+ | Executive Zone | 4.0M (2 blocks) |
152
+ | Encoder + Readout + LM Head | ~127.6M |
153
+ | **Total** | **139.9M** |
154
 
155
  ## Usage
156
 
157
  ```python
158
  import torch
159
+ from nord_core_v4 import NordConfig, NordModel
160
  from transformers import AutoTokenizer
161
 
162
  # Load
163
+ ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
164
  cfg = NordConfig(**ckpt["config"])
165
  model = NordModel(cfg).cuda()
166
+
167
+ # Filter persistent state buffers (size varies with batch)
168
+ state = {k: v for k, v in ckpt["model_state_dict"].items()
169
+ if "_v_mem_state" not in k and "_i_syn_state" not in k}
170
+ model.load_state_dict(state, strict=False)
171
  model.eval()
172
 
173
  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
 
176
  Or use the interactive chat:
177
 
178
  ```bash
179
+ python chat_v4.py
180
+ # Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit
181
  ```
182
 
183
+ ## Generation Examples
 
 
184
 
185
+ **Step 3,600 (loss 5.5)** β€” no coherence:
186
+ > "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."
187
 
188
+ **Step 29,000 (loss 4.5)** β€” topic understanding, broken logic:
189
+ > "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."
190
 
191
+ **Step 39,000 (loss 4.3)** β€” thematic coherence, real entities:
192
+ > "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."
193
 
194
+ ## Spike Dynamics
195
 
196
  | Context | Sparsity | Interpretation |
197
  |---------|----------|----------------|
198
+ | Simple tokens | 95-96% | Confident β€” minimal firing |
199
+ | Complex tokens | 89-91% | More neurons recruited |
200
+ | Training average | 91% | Healthy spike activity |
201
 
202
+ Sparsity is **dynamic and input-dependent** β€” the model recruits more neurons for harder inputs, just like a biological brain.
203
+
204
+ ## Comparison with Other SNN Language Models
205
+
206
+ | Model | Params | From Scratch? | MoE | Zonal | Sparsity |
207
+ |-------|--------|:---:|:---:|:---:|---|
208
+ | **Nord v4.2** | 140M | βœ… | βœ… | βœ… | 91% |
209
+ | Nord v3 | 144M | βœ… | ❌ | ❌ | 97% |
210
+ | SpikeGPT | 216M | βœ… | ❌ | ❌ | ~90% |
211
+ | SpikeLLM | 7-70B | ❌ | ❌ | ❌ | varies |
212
+ | SpikeBERT | ~110M | ❌ | ❌ | ❌ | varies |
213
+
214
+ ## Version History
215
+
216
+ | Version | Key Change | Result |
217
+ |---------|-----------|--------|
218
+ | v3 | First SNN LLM | 97% sparsity, 51K Reddit views |
219
+ | v3.5 | Scale to 500M | Failed β€” sparsity stuck at 100% |
220
+ | v4.1 | MoE + Zonal + Memory | Fixed spikes, loss 4.95 |
221
+ | **v4.2** | **Adaptive regulator + Executive fix** | **Loss 4.3, stable 91% sparsity** |
222
 
223
  ## Limitations
224
 
225
+ - Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
226
+ - Coherence degrades after 2-3 sentences at 140M scale
227
+ - Multilingual leakage in long generations (dataset artifact)
228
+ - Scaling beyond 140M untested for v4.2
229
  - No formal benchmark evaluation yet
230
+ - Hallucination present
231
 
232
+ ## Scaling Hypothesis
233
+
234
+ If zonal specialization persists at scale, an 86B SNN could potentially:
235
+ - Match 86B transformer quality
236
+ - Run inference with compute of a 3-4B dense model (96% sparsity)
237
+ - Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings
238
 
239
+ This is unproven. The roadmap: 140M β†’ 500M β†’ 1-2B, testing at each scale.
 
 
 
 
 
 
240
 
241
  ## Citation
242
 
243
  ```bibtex
244
+ @software{nord2026,
245
+ title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
246
+ author={Zemondsa},
247
+ year={2026},
248
+ url={https://github.com/zemondsa/nord-ai}
249
  }
250
  ```
251
 
252
  ## About
253
 
254
+ Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding β€” just a rented A5000 and curiosity.