File size: 4,971 Bytes
172d88b 28c4656 172d88b aa67673 172d88b aa67673 172d88b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
license: mit
language:
- en
tags:
- complex-valued
- oscillating-neurons
- language-model
- autoregressive
- character-level
- linear-time
- pytorch
- from-scratch
datasets:
- edeneldith/DCDM
pipeline_tag: text-generation
library_name: pytorch
---
# COLM — Complex Oscillating Language Model
> **Paper:** [Zenodo (PDF)](https://doi.org/10.5281/zenodo.20118034) |
> **Code:** [GitHub](https://github.com/Eden-Eldith/COLM) |
> **Dataset:** [edeneldith/DCDM](https://huggingface.co/datasets/edeneldith/DCDM) |
> **Predecessor:** [WiggleGPT (Zenodo)](https://doi.org/10.5281/zenodo.17919011)
**Author:** Phillip C. O'Brien — ORCID [0009-0007-3961-1182](https://orcid.org/0009-0007-3961-1182)
## What is COLM?
COLM is a novel autoregressive language model that operates entirely in the complex number plane using oscillatory neurons. It replaces the transformer's quadratic-complexity self-attention with an O(N) causal recurrence driven by complex-valued gates, and replaces all learned linear transformations in its core blocks with fixed unitary rotations and element-wise complex oscillatory activations.
**Zero `nn.Linear` layers in the processing blocks** — all transformation is performed by the oscillating activation `sin(W * Z + B) * tanh(Z)` where `W, B` are complex-valued, routed through fixed energy-preserving complex mixers.
## Key Results
| Metric | Value |
|--------|-------|
| **Parameters** | 498,214 |
| **Best validation loss** | 1.1449 |
| **Creativity score** (GPT-5.4 blind eval) | 4.83 / 10 |
| **Age group estimate** | 84% rated age 13-16 |
| **Training time** | 8.7 hours |
| **Hardware** | Single RTX 5060 Ti 16GB |
| **Tokenizer** | 499-token word+character hybrid (396 word tokens, 98 character fallback) |
| **Domain** | Theological-philosophical prose |
At 498k parameters — roughly half the size of TinyStories' smallest coherent model — COLM generates thematically coherent philosophical prose at temperature 1 with no spell correction.
## Architecture
| Component | COLM |
|-----------|------|
| State | Native `torch.cfloat` throughout |
| Activation | `sin(W * Z + B) * tanh(Z)`, complex W, B |
| Sequence routing | O(N) causal recurrence via `torch.cumsum` |
| MLP/FFN | Fixed unitary mixer -> Oscillator -> mixer -> Oscillator |
| Residual | Complex sinc resonance coupling |
| Normalisation | ComplexRMSNorm (phase-preserving) |
| Sparsity | Learnable sigmoidal gate on magnitude |
## Model Configuration
```json
{
"n_embd": 324,
"n_layer": 16,
"embed_dim": 66,
"block_size": 128,
"vocab_size": 499
}
```
## Files
| File | Description |
|------|-------------|
| `colm_best_Final.pt` | Best checkpoint (step 860,000, val loss 1.1449) |
| `colm_config.json` | Full training and architecture configuration |
| `colm_tokenizer.json` | 499-token word+character hybrid tokenizer vocabulary |
| `model.py` | All `nn.Module` classes needed to load the model |
## Usage
```python
import torch
import json
from model import COLM
# Load config
with open("colm_config.json") as f:
config = json.load(f)
arch = config["architecture"]
model = COLM(
vocab_size=arch["vocab_size"],
n_embd=arch["n_embd"],
n_layer=arch["n_layer"],
block_size=arch["block_size"],
embed_dim=arch["embed_dim"],
)
# Load weights
checkpoint = torch.load("colm_best_Final.pt", map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
```
See the [GitHub repository](https://github.com/Eden-Eldith/COLM) for full training, generation, and evaluation scripts.
## Training Data
Trained on the [DCDM dataset](https://huggingface.co/datasets/edeneldith/DCDM) — 47 million tokens of synthetic theological-philosophical prose generated from 93 public domain works through a locally-run Gemma 3 12B pipeline.
## Limitations
- **Spelling:** The 499-token vocabulary contains 396 whole-word tokens covering common English and corpus-specific domain words; words outside this vocabulary require character-level assembly, producing spelling variation on out-of-vocabulary terms
- **Single trained model:** The released checkpoint has only generated text in the DCDM theological-philosophical register; cross-domain output from the trained model is untested. The data generation pipeline has been validated across approximately 894,000 tokens of private source material spanning archaeology, theology, mythology, philosophy, political history, intelligence studies, science fiction, and AI research.
- **Batch size:** Final run used batch_size=4 rather than intended 32 — results are a lower bound
## Citation
```bibtex
@misc{obrien2026colm,
author = {O'Brien, Phillip C.},
title = {COLM: Complex Oscillating Language Model — Coherent Language from Sub-500k Parameter Oscillatory Models},
year = {2026},
publisher = {Zenodo},
url = {https://github.com/Eden-Eldith/COLM}
}
```
## Licence
MIT License. Copyright (c) 2025-2026 Phillip C. O'Brien.
|