WiggleGPT / README.md

Wiggly!

85777f5 verified about 2 months ago

3.79 kB

	---
	license: gpl-3.0
	tags:
	- pytorch
	- gpt2
	- transformer
	- oscillating-activation
	- bio-inspired
	- language-model
	language:
	- en
	datasets:
	- openwebtext
	- HuggingFaceTB/smoltalk
	pipeline_tag: text-generation
	---

	# WiggleGPT

	A 124M parameter transformer that challenges a 56-year-old assumption in neural network design.

	![WiggleGPT Architecture](model_architecture.png)

	## What Makes It Different?

	Since Minsky and Papert's Perceptrons (1969), neural networks have relied on monotonic activation functions (Sigmoid, ReLU, GELU) — requiring multiple hidden layers to solve non-linearly separable problems like XOR.

	WiggleGPT replaces monotonic activations with learnable oscillating functions, enabling single neurons to create multiple decision boundaries:

	```
	f(x) = sin(ωx + φ) · tanh(x) + baseline
	```

	Where ω (frequency) and φ (phase) are learnable per-neuron parameters.

	## Results

	\| Model \| Parameters \| Val Loss \| Notes \|
	\|-------\|------------\|----------\|-------\|
	\| WiggleGPT \| 124M \| 3.1621 \| Oscillating activation \|
	\| GPT-2 \| 124M \| ~3.12 \| Standard GELU baseline \|

	Within 1.3% of GPT-2 performance — proving oscillating activations are a functional drop-in replacement at scale.

	### The Model Actually Learned to Oscillate

	\| Parameter \| Init \| After Training \| Change \|
	\|-----------\|------\|----------------\|--------\|
	\| ω mean \| 1.0 \| 1.096 \| +9.6% \|
	\| ω std \| 0.1 \| 0.602 \| 6× increase \|
	\| ω range \| [0.8, 1.2] \| [-0.19, 5.17] \| Massive expansion \|

	- 95% of neurons retained active oscillation (ω > 0.1)
	- Some neurons learned frequencies up to ω = 5.17 (five oscillations per unit input)
	- Full phase coverage [-π, +π] after training

	## Checkpoints

	\| File \| Description \|
	\|------\|-------------\|
	\| `ckpt_pretrain.pt` \| Base model trained on OpenWebText (~600k iterations) \|
	\| `ckpt_finetune.pt` \| Fine-tuned on SmolTalk2 (instruction following) \|

	## Architecture

	\| Component \| Specification \|
	\|-----------\|---------------\|
	\| Parameters \| 123,697,920 \|
	\| Layers \| 12 \|
	\| Attention Heads \| 12 \|
	\| Embedding Dimension \| 768 \|
	\| Oscillating Neurons \| 36,864 (each with learnable ω, φ, baseline) \|
	\| Normalization \| RMSNorm \|
	\| Position Encoding \| RoPE (Rotary) \|
	\| Attention \| Flash Attention (when available) \|

	## Usage

	See the [GitHub repository](https://github.com/Eden-Eldith/WiggleGPT) for full training, inference, and chat scripts.

	```python
	# Quick inference example
	import torch
	from model_bio import GPT, GPTConfig

	# Load checkpoint
	checkpoint = torch.load('ckpt_pretrain.pt', map_location='cuda')
	config = GPTConfig(**checkpoint['config'])
	model = GPT(config)
	model.load_state_dict(checkpoint['model'])
	model.eval()

	# Generate text (see sample_bio.py for full implementation)
	```

	## Training Details

	Pretraining:
	- Dataset: OpenWebText (~9B tokens)
	- Iterations: 600,000
	- Hardware: RTX 3070 (steps 0–354k) → RTX 5060 Ti 16GB (steps 354k–600k)
	- Time: Roughly 20 days total (~15 days on 3070, ~5 days on 5060 Ti)

	Fine-tuning:
	- Dataset: SmolTalk2 (406K examples)
	- Oscillation parameters (ω, φ) remained stable — 0.0% of neurons shifted by >0.1

	## Citation

	```bibtex
	@software{wigglegpt2025,
	author = {O'Brien, Phillip C.},
	title = {WiggleGPT: Revisiting the Monotonicity Assumption in Neural Networks via Oscillating Activation Functions},
	year = {2025},
	url = {https://github.com/Eden-Eldith/WiggleGPT}
	}
	```

	## Author

	Eden (Phillip C. O'Brien)
	Independent AI Researcher \| ORCID: [0009-0007-3961-1182](https://orcid.org/0009-0007-3961-1182)

	Built in a garage lab in Gosport, UK. No academic affiliation, no institutional funding — just curiosity and an RTX 3070.

	## License

	GPL-3.0 — if you build on this, keep it open source.