Nord-AI / README.md

Update README.md

68c9b40 verified about 23 hours ago

10.6 kB

license: apache-2.0
language:
  - en
pipeline_tag: text-generation
tags:
  - snn
  - spiking-neural-network
  - neuromorphic
  - language-model
  - from-scratch
  - energy-efficient
  - mixture-of-experts
  - brain-inspired

⚡ Nord v4.2 — Brain-Inspired Spiking Neural Network Language Model (140M)

The first SNN language model with spike-driven MoE, zonal specialization, and memory cortex — trained from scratch.

What's New in v4.2

Nord v4.2 is a complete architectural rebuild from v3. The key breakthrough: the model self-organizes into functionally distinct brain zones during training — sensory zones learn low firing rates, executive zones learn high firing rates, with no explicit supervision.

	v3 (previous)	v4.2 (current)
Parameters	144M	140M
Sparsity	97% (but spikes broken at scale)	91% (spikes working)
MoE	None	Spike-driven, 4 experts top-2
Memory	None	128-neuron cortex, τ=0.99
Zonal architecture	No	Yes (self-organizing)
Loss at 39K steps	~4.9	4.3
Training speed	Slower convergence	35% faster to same loss

Model Description

Nord v4.2 is a 140M-parameter Spiking Neural Network (SNN) for text generation. It uses biologically-inspired Leaky Integrate-and-Fire neurons with membrane potentials, firing thresholds, and binary spikes. Unlike transformers where 100% of neurons activate per token, Nord activates only 3-9% — with different brain-inspired zones specializing in different functions.

Trained entirely from scratch — no transformer teacher, no distillation, no ANN-to-SNN conversion.

Key Features

Feature	Details
Parameters	139.9M
Architecture	Original brain-inspired zonal SNN
Zones	Sensory → Association (MoE) → Memory → Executive
MoE	4 spike-driven experts, top-2 routing
Memory	128 persistent neurons, gated temporal attention
Sparsity	89-95% (dynamic, input-dependent)
Timesteps	10 (8 fast + 2 slow)
Training method	Surrogate gradients + spike homeostasis
Training data	~2.2M samples, general English corpus
Training cost	~$15 USD
Online learning	STDP available during inference

Architecture

┌───────────────────────────────────────────────┐
│  Temporal Spike Encoder                       │
│  Token → 8 fast + 2 slow timestep currents    │
├───────────────────────────────────────────────┤
│  Sensory Zone (2 blocks)     rates: 8-10%     │
│  Standard FFN + LIF, feature extraction       │
├───────────────────────────────────────────────┤
│  Association Zone (2 blocks) rates: 10-14%    │
│  Spike-Driven MoE (4 experts, top-2) + LIF   │
├───────────────────────────────────────────────┤
│  Memory Cortex               rates: 0.5-1%    │
│  128 neurons, τ=0.99, gated temporal attn     │
├───────────────────────────────────────────────┤
│  Executive Zone (2 blocks)   rates: 11-26%    │
│  Standard FFN + LIF, decision & output        │
├───────────────────────────────────────────────┤
│  Readout (EMA over membrane potential)        │
│  → LM Head → vocabulary logits                │
└───────────────────────────────────────────────┘

Key Components

Associative LIF Neurons — Learnable membrane time constants, voltage thresholds, synaptic currents, cascade amplification across 64 neural clusters
ATan Surrogate Gradient — Differentiable spike function for backpropagation
Spike-Driven MoE — Expert routing based on cluster spike-rate activity, not dense networks
Memory Cortex — Persistent slow memory with multi-head temporal attention readout
Adaptive Spike Regulator — Asymmetric homeostasis: penalizes too-low firing 3x more than too-high, anti-death floor at 1%
RoPE — Rotary position embeddings for sequence position encoding
Synaptic Resonance Attention — Temporal mixing over spike patterns (not naive flattening)

Model Configuration

d_model: 496
n_heads: 8
n_layers: 6 (2 sensory + 2 association + 2 executive)
d_ff: 1024
n_experts: 4
top_k_experts: 2
memory_size: 128
T_fast: 8, T_slow: 2
max_seq_len: 512
vocab_size: 128,256
tokenizer: Llama-3.2 (meta-llama/Llama-3.2-1B)

Emergent Zonal Specialization

The most significant finding: the model self-organizes functionally distinct zones during standard training. No manual assignment, no hardcoded rates.

Zone              Spike Rate    Biological Analog
─────────────────────────────────────────────────────
Sensory           8-10%         Primary sensory cortex
Association       10-14%        Parietal/temporal cortex
Memory Cortex     0.5-1%        Hippocampus (selective)
Executive [0]     11-15%        Premotor cortex
Executive [1]     22-26%        Prefrontal cortex
─────────────────────────────────────────────────────

This mirrors biological cortical organization where prefrontal cortex has higher baseline activity than sensory cortex.

Training

Dataset: ~2.2M text samples, general English corpus
Hardware: NVIDIA A5000 24GB (rented on Vast.ai)
Optimizer: AdamW (lr=3e-4 → 1e-5 cosine decay, weight_decay=0.01)
Batch size: 2 × grad_accum=16 (effective 32)
Sequence length: 512

Loss Progression

Step	Loss	Sparsity	LR	Event
0	8.9	68%	warmup	Start
1,500	6.2	69%	3.0e-04	Rapid descent
10,000	4.95	99%	3.0e-04	v4.1 plateau, spikes dying
14,000	7.6→5.2	75%	3.0e-04	v4.2 fixes, spike revival
20,000	4.70	91%	3.0e-04	Surpassed v4.1
30,000	4.50	91%	1.2e-04	Cosine decay
39,000	4.30	91%	6.0e-05	Current best

Parameter Breakdown

Component	Parameters
Sensory Zone	4.0M (2 blocks)
Association Zone	4.1M (2 blocks, MoE)
Memory Cortex	0.2M
Executive Zone	4.0M (2 blocks)
Encoder + Readout + LM Head	~127.6M
Total	139.9M

Usage

import torch
from nord_core_v4 import NordConfig, NordModel
from transformers import AutoTokenizer

# Load
ckpt = torch.load("nord_v4_latest.pt", map_location="cuda")
cfg = NordConfig(**ckpt["config"])
model = NordModel(cfg).cuda()

# Filter persistent state buffers (size varies with batch)
state = {k: v for k, v in ckpt["model_state_dict"].items()
         if "_v_mem_state" not in k and "_i_syn_state" not in k}
model.load_state_dict(state, strict=False)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

Or use the interactive chat:

python chat_v4.py
# Commands: /stats, /memory, /expert, /stdp on|off, /reset, /quit

Generation Examples

Step 3,600 (loss 5.5) — no coherence:

"Queen was being too late. The lake is not to be found in a variety of birds and stynesan trees."

Step 29,000 (loss 4.5) — topic understanding, broken logic:

"The internet is equipped with computers that harness data from television and radio vehicles. Its central and large uses can help business use and share information on devices and systems."

Step 39,000 (loss 4.3) — thematic coherence, real entities:

"A cybersecurity campaign that uses a computer science machine learning robot to guide players, and has refined algorithms. The popular game research software made by OpenAI security researchers..."

Spike Dynamics

Context	Sparsity	Interpretation
Simple tokens	95-96%	Confident — minimal firing
Complex tokens	89-91%	More neurons recruited
Training average	91%	Healthy spike activity

Sparsity is dynamic and input-dependent — the model recruits more neurons for harder inputs, just like a biological brain.

Comparison with Other SNN Language Models

Model	Params	From Scratch?	MoE	Zonal	Sparsity
Nord v4.2	140M	✅	✅	✅	91%
Nord v3	144M	✅	❌	❌	97%
SpikeGPT	216M	✅	❌	❌	~90%
SpikeLLM	7-70B	❌	❌	❌	varies
SpikeBERT	~110M	❌	❌	❌	varies

Version History

Version	Key Change	Result
v3	First SNN LLM	97% sparsity, 51K Reddit views
v3.5	Scale to 500M	Failed — sparsity stuck at 100%
v4.1	MoE + Zonal + Memory	Fixed spikes, loss 4.95
v4.2	Adaptive regulator + Executive fix	Loss 4.3, stable 91% sparsity

Limitations

Text quality not competitive with GPT-2 at same parameter count (loss 4.3 vs ~3.0)
Coherence degrades after 2-3 sentences at 140M scale
Multilingual leakage in long generations (dataset artifact)
Scaling beyond 140M untested for v4.2
No formal benchmark evaluation yet
Hallucination present

Scaling Hypothesis

If zonal specialization persists at scale, an 86B SNN could potentially:

Match 86B transformer quality
Run inference with compute of a 3-4B dense model (96% sparsity)
Deploy on neuromorphic hardware (Intel Loihi) with orders of magnitude energy savings

This is unproven. The roadmap: 140M → 500M → 1-2B, testing at each scale.

Citation

@software{nord2026,
  title={Nord v4.2: Brain-Inspired Spiking Neural Network Language Model with Spike-Driven MoE and Zonal Specialization},
  author={Zemondsa},
  year={2026},
  url={https://github.com/zemondsa/nord-ai}
}

About

Built solo by an 18-year-old Ukrainian student studying electronics in Norway. No PhD, no team, no funding — just a rented A5000 and curiosity.