edeneldith commited on
Commit
172d88b
·
verified ·
1 Parent(s): e254270

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - complex-valued
7
+ - oscillating-neurons
8
+ - language-model
9
+ - autoregressive
10
+ - character-level
11
+ - linear-time
12
+ - pytorch
13
+ - from-scratch
14
+ datasets:
15
+ - edeneldith/DCDM
16
+ pipeline_tag: text-generation
17
+ library_name: pytorch
18
+ ---
19
+
20
+ # COLM — Complex Oscillating Language Model
21
+
22
+ > **Paper:** [Zenodo (PDF)](https://doi.org/10.5281/zenodo.XXXXXXX) |
23
+ > **Code:** [GitHub](https://github.com/Eden-Eldith/COLM) |
24
+ > **Dataset:** [edeneldith/DCDM](https://huggingface.co/datasets/edeneldith/DCDM) |
25
+ > **Predecessor:** [WiggleGPT (Zenodo)](https://doi.org/10.5281/zenodo.17919011)
26
+
27
+ **Author:** Phillip C. O'Brien — ORCID [0009-0007-3961-1182](https://orcid.org/0009-0007-3961-1182)
28
+
29
+ ## What is COLM?
30
+
31
+ COLM is a novel autoregressive language model that operates entirely in the complex number plane using oscillatory neurons. It replaces the transformer's quadratic-complexity self-attention with an O(N) causal recurrence driven by complex-valued gates, and replaces all learned linear transformations in its core blocks with fixed unitary rotations and element-wise complex oscillatory activations.
32
+
33
+ **Zero `nn.Linear` layers in the processing blocks** — all transformation is performed by the oscillating activation `sin(W * Z + B) * tanh(Z)` where `W, B` are complex-valued, routed through fixed energy-preserving complex mixers.
34
+
35
+ ## Key Results
36
+
37
+ | Metric | Value |
38
+ |--------|-------|
39
+ | **Parameters** | 498,214 |
40
+ | **Best validation loss** | 1.1449 |
41
+ | **Creativity score** (GPT-5.4 blind eval) | 4.83 / 10 |
42
+ | **Age group estimate** | 84% rated age 13-16 |
43
+ | **Training time** | 8.7 hours |
44
+ | **Hardware** | Single RTX 5060 Ti 16GB |
45
+ | **Tokenizer** | 499-token word+character hybrid |
46
+ | **Domain** | Theological-philosophical prose |
47
+
48
+ At 498k parameters — roughly half the size of TinyStories' smallest coherent model — COLM generates thematically coherent philosophical prose at temperature 1 with no spell correction.
49
+
50
+ ## Architecture
51
+
52
+ | Component | COLM |
53
+ |-----------|------|
54
+ | State | Native `torch.cfloat` throughout |
55
+ | Activation | `sin(W * Z + B) * tanh(Z)`, complex W, B |
56
+ | Sequence routing | O(N) causal recurrence via `torch.cumsum` |
57
+ | MLP/FFN | Fixed unitary mixer -> Oscillator -> mixer -> Oscillator |
58
+ | Residual | Complex sinc resonance coupling |
59
+ | Normalisation | ComplexRMSNorm (phase-preserving) |
60
+ | Sparsity | Learnable sigmoidal gate on magnitude |
61
+
62
+ ## Model Configuration
63
+
64
+ ```json
65
+ {
66
+ "n_embd": 324,
67
+ "n_layer": 16,
68
+ "embed_dim": 66,
69
+ "block_size": 128,
70
+ "vocab_size": 499
71
+ }
72
+ ```
73
+
74
+ ## Files
75
+
76
+ | File | Description |
77
+ |------|-------------|
78
+ | `colm_best_Final.pt` | Best checkpoint (step 860,000, val loss 1.1449) |
79
+ | `colm_config.json` | Full training and architecture configuration |
80
+ | `colm_tokenizer.json` | 499-token word+character hybrid tokenizer vocabulary |
81
+ | `model.py` | All `nn.Module` classes needed to load the model |
82
+
83
+ ## Usage
84
+
85
+ ```python
86
+ import torch
87
+ import json
88
+ from model import COLM
89
+
90
+ # Load config
91
+ with open("colm_config.json") as f:
92
+ config = json.load(f)
93
+
94
+ arch = config["architecture"]
95
+ model = COLM(
96
+ vocab_size=arch["vocab_size"],
97
+ n_embd=arch["n_embd"],
98
+ n_layer=arch["n_layer"],
99
+ block_size=arch["block_size"],
100
+ embed_dim=arch["embed_dim"],
101
+ )
102
+
103
+ # Load weights
104
+ checkpoint = torch.load("colm_best_Final.pt", map_location="cpu")
105
+ model.load_state_dict(checkpoint["model_state_dict"])
106
+ model.eval()
107
+ ```
108
+
109
+ See the [GitHub repository](https://github.com/Eden-Eldith/COLM) for full training, generation, and evaluation scripts.
110
+
111
+ ## Training Data
112
+
113
+ Trained on the [DCDM dataset](https://huggingface.co/datasets/edeneldith/DCDM) — 47 million tokens of synthetic theological-philosophical prose generated from 93 public domain works through a locally-run Gemma 3 12B pipeline.
114
+
115
+ ## Limitations
116
+
117
+ - **Spelling:** The 499-token vocabulary means most words are assembled from character tokens, producing spelling variation
118
+ - **Single domain:** Trained only on theological-philosophical text; cross-domain performance is untested
119
+ - **Batch size:** Final run used batch_size=4 rather than intended 32 — results are a lower bound
120
+
121
+ ## Citation
122
+
123
+ ```bibtex
124
+ @misc{obrien2026colm,
125
+ author = {O'Brien, Phillip C.},
126
+ title = {COLM: Complex Oscillating Language Model — Coherent Language from Sub-500k Parameter Oscillatory Models},
127
+ year = {2026},
128
+ publisher = {Zenodo},
129
+ url = {https://github.com/Eden-Eldith/COLM}
130
+ }
131
+ ```
132
+
133
+ ## Licence
134
+
135
+ MIT License. Copyright (c) 2025-2026 Phillip C. O'Brien.