Nord-AI / README.md

Update README.md

68c9b40 verified 1 day ago

10.6 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- snn
	- spiking-neural-network
	- neuromorphic
	- language-model
	- from-scratch
	- energy-efficient
	- mixture-of-experts
	- brain-inspired
	---

	# ⚡ Nord v4.2 — Brain-Inspired Spiking Neural Network Language Model (140M)

	The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex — trained from scratch.

	## What's New in v4.2

	Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: the model self-organizes into functionally distinct brain zones during training — sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.

	\| \| v3 (previous) \| v4.2 (current) \|
	\|---\|---\|---\|
	\| Parameters \| 144M \| 140M \|
	\| Sparsity \| 97% (but spikes broken at scale) \| 91% (spikes working) \|
	\| MoE \| None \| Spike-driven, 4 experts top-2 \|
	\| Memory \| None \| 128-neuron cortex, τ=0.99 \|
	\| Zonal architecture \| No \| Yes (self-organizing) \|
	\| Loss at 39K steps \| ~4.9 \| 4.3 \|
	\| Training speed \| Slower convergence \| 35% faster to same loss \|

	## Model Description

	Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only 3-9% — with different brain-inspired zones specializing in different functions.

	Trained entirely from scratch — no transformer teacher, no distillation, no ANN-to-SNN conversion.

	## Key Features

	\| Feature \| Details \|
	\|---------\|---------\|
	\| Parameters \| 139.9M \|
	\| Architecture \| Original brain-inspired zonal SNN \|
	\| Zones \| Sensory → Association (MoE) → Memory → Executive \|
	\| MoE \| 4 spike-driven experts, top-2 routing \|
	\| Memory \| 128 persistent neurons, gated temporal attention \|
	\| Sparsity \| 89-95% (dynamic, input-dependent) \|
	\| Timesteps \| 10 (8 fast + 2 slow) \|
	\| Training method \| Surrogate gradients + spike homeostasis \|
	\| Training data \| ~2.2M samples, general English corpus \|
	\| Training cost \| ~$15 USD \|
	\| Online learning \| STDP available during inference \|

	## Architecture

	```
	┌───────────────────────────────────────────────┐
	│ Temporal Spike Encoder │
	│ Token → 8 fast + 2 slow timestep currents │
	├───────────────────────────────────────────────┤
	│ Sensory Zone (2 blocks) rates: 8-10% │
	│ Standard FFN + LIF, feature extraction │
	├───────────────────────────────────────────────┤
	│ Association Zone (2 blocks) rates: 10-14% │
	│ Spike-Driven MoE (4 experts, top-2) + LIF │
	├───────────────────────────────────────────────┤
	│ Memory Cortex rates: 0.5-1% │
	│ 128 neurons, τ=0.99, gated temporal attn │
	├───────────────────────────────────────────────┤
	│ Executive Zone (2 blocks) rates: 11-26% │
	│ Standard FFN + LIF, decision & output │
	├───────────────────────────────────────────────┤
	│ Readout (EMA over membrane potential) │
	│ → LM Head → vocabulary logits │
	└───────────────────────────────────────────────┘
	```

	### Key Components

	- Associative LIF Neurons — Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
	- ATan Surrogate Gradient — Differentiable spike function for backpropagation
	- Spike-Driven MoE — Expert routing based on cluster spike-rate activity, not dense networks
	- Memory Cortex — Persistent slow memory with multi-head temporal attention readout
	- Adaptive Spike Regulator — Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
	- RoPE — Rotary position embeddings for sequence position encoding
	- Synaptic Resonance Attention — Temporal mixing over spike patterns (not naive flattening)

	### Model Configuration

	```python
	d_model: 496
	n_heads: 8
	n_layers: 6 (2 sensory + 2 association + 2 executive)
	d_ff: 1024
	n_experts: 4
	top_k_experts: 2
	memory_size: 128
	T_fast: 8, T_slow: 2
	max_seq_len: 512
	vocab_size: 128,256
	tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)
	```

	## Emergent Zonal Specialization

	The most significant finding: the model self-organizes functionally distinct zones during standard training. No manual assignment, no hardcoded rates.

	```
	Zone Spike Rate Biological Analog
	─────────────────────────────────────────────────────
	Sensory 8-10% Primary sensory cortex
	Association 10-14% Parietal/temporal cortex
	Memory Cortex 0.5-1% Hippocampus (selective)
	Executive [0] 11-15% Premotor cortex
	Executive [1] 22-26% Prefrontal cortex
	─────────────────────────────────────────────────────
	```

	This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.

	## Training

	- Dataset: ~2.2M text samples, general English corpus
	- Hardware: NVIDIA A5000 24GB (rented on Vast.ai)
	- Optimizer: AdamW (lr=3e-4 → 1e-5 cosine decay, weight_decay=0.01)
	- Batch size: 2 × grad_accum=16 (effective 32)
	- Sequence length: 512

	### Loss Progression

	\| Step \| Loss \| Sparsity \| LR \| Event \|
	\|------\|------\|----------\|-----\|-------\|
	\| 0 \| 8.9 \| 68% \| warmup \| Start \|
	\| 1,500 \| 6.2 \| 69% \| 3.0e-04 \| Rapid descent \|
	\| 10,000 \| 4.95 \| 99% \| 3.0e-04 \| v4.1 plateau, spikes dying \|
	\| 14,000 \| 7.6→5.2 \| 75% \| 3.0e-04 \| v4.2 fixes, spike revival \|
	\| 20,000 \| 4.70 \| 91% \| 3.0e-04 \| Surpassed v4.1 \|
	\| 30,000 \| 4.50 \| 91% \| 1.2e-04 \| Cosine decay \|
	\| 39,000 \| 4.30 \| 91% \| 6.0e-05 \| Current best \|

	### Parameter Breakdown

	\| Component \| Parameters \|
	\|-----------\|-----------\|
	\| Sensory Zone \| 4.0M (2 blocks) \|
	\| Association Zone \| 4.1M (2 blocks, MoE) \|
	\| Memory Cortex \| 0.2M \|
	\| Executive Zone \| 4.0M (2 blocks) \|
	\| Encoder + Readout + LM Head \| ~127.6M \|
	\| Total \| 139.9M \|

	## Usage

	```python
	import torch
	from nord_core_v4 import NordConfig, NordModel
	from transformers import AutoTokenizer

	# Load
	ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
	cfg = NordConfig(**ckpt["config"])
	model = NordModel(cfg).cuda()

	# Filter persistent state buffers (size varies with batch)
	state = {k: v for k, v in ckpt["model_state_dict"].items()
	if "_v_mem_state" not in k and "_i_syn_state" not in k}
	model.load_state_dict(state, strict=False)
	model.eval()

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
	```

	Or use the interactive chat:

	```bash
	python chat_v4.py
	# Commands: /stats, /memory, /expert, /stdp on\|off, /reset, /quit
	```

	## Generation Examples

	Step 3,600 (loss 5.5) — no coherence:
	> "Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."

	Step 29,000 (loss 4.5) — topic understanding, broken logic:
	> "The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."

	Step 39,000 (loss 4.3) — thematic coherence, real entities:
	> "A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."

	## Spike Dynamics

	\| Context \| Sparsity \| Interpretation \|
	\|---------\|----------\|----------------\|
	\| Simple tokens \| 95-96% \| Confident — minimal firing \|
	\| Complex tokens \| 89-91% \| More neurons recruited \|
	\| Training average \| 91% \| Healthy spike activity \|

	Sparsity is dynamic and input-dependent — the model recruits more neurons for harder inputs, just like a biological brain.

	## Comparison with Other SNN Language Models

	\| Model \| Params \| From Scratch? \| MoE \| Zonal \| Sparsity \|
	\|-------\|--------\|:---:\|:---:\|:---:\|---\|
	\| Nord v4.2 \| 140M \| ✅ \| ✅ \| ✅ \| 91% \|
	\| Nord v3 \| 144M \| ✅ \| ❌ \| ❌ \| 97% \|
	\| SpikeGPT \| 216M \| ✅ \| ❌ \| ❌ \| ~90% \|
	\| SpikeLLM \| 7-70B \| ❌ \| ❌ \| ❌ \| varies \|
	\| SpikeBERT \| ~110M \| ❌ \| ❌ \| ❌ \| varies \|

	## Version History

	\| Version \| Key Change \| Result \|
	\|---------\|-----------\|--------\|
	\| v3 \| First SNN LLM \| 97% sparsity, 51K Reddit views \|
	\| v3.5 \| Scale to 500M \| Failed — sparsity stuck at 100% \|
	\| v4.1 \| MoE + Zonal + Memory \| Fixed spikes, loss 4.95 \|
	\| v4.2 \| Adaptive regulator + Executive fix \| Loss 4.3, stable 91% sparsity \|

	## Limitations

	- Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
	- Coherence degrades after 2-3 sentences at 140M scale
	- Multilingual leakage in long generations (dataset artifact)
	- Scaling beyond 140M untested for v4.2
	- No formal benchmark evaluation yet
	- Hallucination present

	## Scaling Hypothesis

	If zonal specialization persists at scale, an 86B SNN could potentially:
	- Match 86B transformer quality
	- Run inference with compute of a 3-4B dense model (96% sparsity)
	- Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings

	This is unproven. The roadmap: 140M → 500M → 1-2B, testing at each scale.

	## Citation

	```bibtex
	@software{nord2026,
	title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
	author={Zemondsa},
	year={2026},
	url={https://github.com/zemondsa/nord-ai}
	}
	```

	## About

	Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding — just a rented A5000 and curiosity.