LisaMegaWatts commited on
Commit
3e52795
·
verified ·
1 Parent(s): a3a67ef

Restore v1 model (val loss 3.62, full 12305-step training)

Browse files
Files changed (1) hide show
  1. README.md +114 -146
README.md CHANGED
@@ -16,7 +16,10 @@ tags:
16
  - swiglu
17
  - bpe
18
  - text-generation
 
19
  pipeline_tag: text-generation
 
 
20
  model-index:
21
  - name: SymbioSLM
22
  results:
@@ -28,196 +31,161 @@ model-index:
28
  name: philosophy-corpus
29
  metrics:
30
  - type: perplexity
31
- value: 79.9
32
- name: Val PPL (step 1000)
 
 
 
 
 
33
  ---
34
 
35
  # SymbioSLM
36
 
37
- A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
38
 
39
  ## Architecture
40
 
41
- Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
42
 
43
- ```
44
- SymbioBlock (x6)
45
- +-- RMSNorm
46
- +-- SymbioSequenceMixer
47
- | +-- Organelle 1: CausalDepthwiseConv1d (local n-gram patterns, K=4)
48
- | +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
49
- | +-- Organelle 3: LongConv (global dense causal filter)
50
- | +-- OrganelleGate (per-channel softmax fusion)
51
- +-- RMSNorm
52
- +-- SwiGLU FFN
53
- ```
54
-
55
- ### How It Works
56
 
57
- 1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
 
 
 
 
58
 
59
- 2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
60
 
61
- 3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
62
 
63
- 4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
64
 
65
- No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
66
-
67
- ## Model Details
68
 
69
  | Parameter | Value |
70
- |---|---|
71
- | Architecture | Symbiogenesis (3 organelles + gate) |
72
- | Parameters | ~4.1M |
73
- | Embed dim | 256 |
74
- | Layers | 6 |
75
- | Monarch heads | 4 |
76
- | Context length | 256 tokens |
77
- | Vocabulary | 2,000 (ByteLevel BPE) |
78
- | FFN | SwiGLU (hidden=640) |
79
  | Normalization | RMSNorm (pre-norm) |
80
- | Weight tying | Yes (shared input/output embeddings) |
81
- | Precision | Float32 (F16 slower for Monarch block sizes) |
 
 
82
 
83
  ### Parameter Breakdown
84
 
85
  | Component | Params | % |
86
- |---|---|---|
87
- | Token embedding (tied) | 512K | 12.6% |
88
- | CausalConv (x6) | 6.1K | 0.2% |
89
- | Monarch heads (x6, 4 heads each) | 197K | 4.8% |
90
- | LongConv (x6) | 393K | 9.7% |
91
- | OrganelleGate (x6) | 4.6K | 0.1% |
92
- | SwiGLU FFN (x6) | 2.95M | 72.6% |
93
- | RMSNorm (x13) | 3.3K | <0.1% |
94
- | **Total** | **~4.1M** | |
95
-
96
- ### Sequence Mixing Efficiency
97
-
98
- | | Transformer | Monarch | Symbiogenesis |
99
- |---|---|---|---|
100
- | Seq mixer params/block | 262K | 67K | 100K |
101
- | Reduction vs Transformer | - | 74% | **62%** |
102
- | Position encoding | RoPE (separate) | None | None |
103
-
104
- ## Training
105
-
106
- | | Value |
107
- |---|---|
108
- | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
109
- | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
110
- | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
111
- | Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
112
  | Batch size | 32 |
113
- | Hardware | NVIDIA RTX 3060 12GB |
114
- | Throughput | ~19K tok/s (Float32) |
115
- | Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
116
-
117
- ### Training Progress (partial)
118
-
119
- | Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
120
- |---|---|---|---|---|
121
- | 1 | 17.10 | 17.03 | 24.9M | 1.099 |
122
- | 500 | 6.50 | 4.92 | 137.5 | 1.098 |
123
- | 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
124
-
125
- ### Gelation Monitoring
126
-
127
- Training includes phase transition detection inspired by polymer physics:
128
-
129
- - **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
130
- - **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
131
- - **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
- ## Comparison with Other Julia SLM Variants
134
 
135
- | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
136
- |---|---|---|---|
137
- | Architecture | Transformer | Monarch Mixer | Symbiogenesis |
138
- | Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
139
- | Parameters | 5.04M | 4.98M | ~4.1M |
140
- | Layers | 6 | 8 | 6 |
141
- | Val PPL | **34.5** | 38.4 | TBD |
142
- | Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
143
- | Position encoding | RoPE | None | None |
144
 
145
  ## Usage
146
 
147
- ### Generate with Julia
148
 
149
  ```julia
150
- using Pkg; Pkg.activate("julia-slm")
151
- include("src/JuliaGPT.jl")
152
- using .JuliaGPT
153
- using .JuliaGPT: Lux, CUDA
154
-
155
- tok = BPETokenizer("vocab.json", "merges.txt")
156
- device = Lux.gpu_device()
157
- ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
158
-
159
- model = create_model(ModelConfig(;
160
- arch="symbiogenesis", vocab_size=vocab_size(tok),
161
- embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
162
- n_monarch_heads=4, conv_kernel_size=4,
163
- ffn_mult=4, context_length=256, weight_tying=true,
164
- ))
165
-
166
- text = generate(model, ps, st, tok, "the nature of ";
167
- max_new_tokens=200, temperature=0.8, top_k=40)
168
- println(text)
169
- ```
170
 
171
- ### OpenAI-Compatible API
 
 
 
172
 
173
- The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):
 
174
 
175
- ```bash
176
- curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
177
- -H "Content-Type: application/json" \
178
- -d '{
179
- "messages": [{"role": "user", "content": "the nature of"}],
180
- "max_tokens": 200,
181
- "temperature": 0.8,
182
- "top_k": 40
183
- }'
184
  ```
185
 
186
- Streaming supported with `"stream": true`.
187
-
188
- ## Files
189
-
190
- | File | Description |
191
- |---|---|
192
- | `final.jld2` | Trained model parameters (JLD2 format) |
193
- | `config.toml` | Model architecture configuration |
194
- | `vocab.json` | BPE vocabulary (2000 tokens) |
195
- | `merges.txt` | BPE merge rules |
196
-
197
- ## Biological Inspiration
198
-
199
- The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
200
 
201
- Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles analogous to the degree of specialization achieved through evolutionary integration.
 
 
 
202
 
203
  ## Citation
204
 
205
  ```bibtex
206
- @misc{symbioslm2026,
207
- title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
208
  author={LisaMegaWatts},
209
  year={2026},
210
  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
211
  }
212
  ```
213
 
214
- ## References
215
-
216
- - Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
217
- - Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
218
- - Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
219
- - Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
220
-
221
  ## License
222
 
223
  MIT
 
16
  - swiglu
17
  - bpe
18
  - text-generation
19
+ - attention-free
20
  pipeline_tag: text-generation
21
+ datasets:
22
+ - LisaMegaWatts/philosophy-corpus
23
  model-index:
24
  - name: SymbioSLM
25
  results:
 
31
  name: philosophy-corpus
32
  metrics:
33
  - type: perplexity
34
+ value: 37.3
35
+ name: Val PPL
36
+ verified: false
37
+ - type: loss
38
+ value: 3.62
39
+ name: Val Loss
40
+ verified: false
41
  ---
42
 
43
  # SymbioSLM
44
 
45
+ A **5.05M parameter** attention-free language model using the **Symbiogenesis** architecture — multi-organelle sequence mixing with learned per-channel gating. Trained on a philosophy corpus of 981 classical texts (~795M tokens).
46
 
47
  ## Architecture
48
 
49
+ Symbiogenesis replaces self-attention with three complementary "organelles" for sequence mixing, inspired by the biological theory of symbiogenesis (Margulis, 1967) — where complex organelles like mitochondria were once independent organisms that fused into eukaryotic cells.
50
 
51
+ Each of the 8 SymbioBlocks contains:
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ | Organelle | Function | Scale | Complexity |
54
+ |-----------|----------|-------|------------|
55
+ | **CausalDepthwiseConv1d** | Local n-gram pattern detection | Local (kernel=4) | O(n) |
56
+ | **Monarch Matrix** | Sub-quadratic global sequence mixing | Global | O(n&radic;n) |
57
+ | **LongConv** | Dense causal convolution filtering | Global | O(n log n) |
58
 
59
+ An **OrganelleGate** (per-channel softmax) learns which organelle each embedding channel relies on, creating specialized "fused organisms" per block.
60
 
61
+ ### No Positional Encoding
62
 
63
+ SymbioSLM requires **no explicit positional encoding** (no RoPE, no sinusoidal embeddings). The Monarch matrices and LongConv kernels implicitly learn position-dependent mixing patterns, while CausalConv captures local ordering through its convolutional structure.
64
 
65
+ ### Model Specifications
 
 
66
 
67
  | Parameter | Value |
68
+ |-----------|-------|
69
+ | Architecture | Symbiogenesis |
70
+ | Parameters | 5,052,672 (5.05M) |
71
+ | Embedding dim | 256 |
72
+ | Layers | 8 |
73
+ | Monarch heads | 1 per block |
74
+ | Conv kernel | 4 |
75
+ | FFN | SwiGLU (4x, 2/3 adjusted) |
 
76
  | Normalization | RMSNorm (pre-norm) |
77
+ | Context length | 256 tokens |
78
+ | Vocab size | 2,000 (BPE) |
79
+ | Weight tying | Yes |
80
+ | Free energy reg | 0.001 |
81
 
82
  ### Parameter Breakdown
83
 
84
  | Component | Params | % |
85
+ |-----------|--------|---|
86
+ | Token embedding | 512,000 | 10.1% |
87
+ | SymbioBlocks (8x) | 4,540,672 | 89.9% |
88
+ | &nbsp;&nbsp; CausalConv | ~8K/block | |
89
+ | &nbsp;&nbsp; Monarch | ~131K/block | |
90
+ | &nbsp;&nbsp; LongConv | ~65K/block | |
91
+ | &nbsp;&nbsp; OrganelleGate | ~769/block | |
92
+ | &nbsp;&nbsp; SwiGLU FFN | ~350K/block | |
93
+ | &nbsp;&nbsp; RMSNorm (2x) | ~512/block | |
94
+ | Final RMSNorm | 256 | <0.1% |
95
+
96
+ ## Results
97
+
98
+ Trained for 12,305 steps on an NVIDIA RTX 3060 (12GB).
99
+
100
+ | Metric | Value |
101
+ |--------|-------|
102
+ | **Val Loss** | **3.62** |
103
+ | **Val PPL** | **37.3** |
104
+ | Training steps | 12,305 |
 
 
 
 
 
 
105
  | Batch size | 32 |
106
+ | Precision | Float16 (AMP) |
107
+
108
+ ### Comparison with Other 5M Julia SLMs
109
+
110
+ All models trained on the same philosophy corpus with identical tokenizer and training budget (12,305 steps):
111
+
112
+ | Model | Architecture | Params | Val Loss | Val PPL |
113
+ |-------|-------------|--------|----------|---------|
114
+ | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer (RoPE) | 5.04M | **3.54** | **34.5** |
115
+ | **SymbioSLM** | **Symbiogenesis** | **5.05M** | **3.62** | **37.3** |
116
+ | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 5.04M | 3.65 | 38.4 |
117
+
118
+ SymbioSLM outperforms the Monarch-only baseline while using no attention mechanism. The multi-organelle fusion provides complementary mixing at different scales that a single mixer cannot achieve alone.
119
+
120
+ ## Training Configuration
121
+
122
+ ```toml
123
+ [model]
124
+ arch = "symbiogenesis"
125
+ embed_dim = 256
126
+ n_layers = 8
127
+ n_monarch_heads = 1
128
+ conv_kernel_size = 4
129
+ ffn_mult = 4
130
+ context_length = 256
131
+ weight_tying = true
132
+ free_energy_beta = 0.001
133
+
134
+ [training]
135
+ optimizer = "adamw"
136
+ lr = 6e-4
137
+ min_lr = 6e-5
138
+ warmup_steps = 500
139
+ max_steps = 12305
140
+ batch_size = 32
141
+ grad_clip = 1.0
142
+ precision = "f16"
143
+ ```
144
 
145
+ ## Gelation Monitoring
146
 
147
+ Training includes gelation monitoring via CUSUM change-point detection on gate entropy. This tracks when the organelle gates transition from uniform mixing to specialized configurations — a phase transition analogous to gel formation in polymer physics.
 
 
 
 
 
 
 
 
148
 
149
  ## Usage
150
 
151
+ ### Julia (Lux.jl)
152
 
153
  ```julia
154
+ using JuliaGPT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
+ # Load model
157
+ config = load_config("config.toml")
158
+ model = create_model(config.model)
159
+ ps, st, _, _, _ = load_checkpoint("final.jld2")
160
 
161
+ # Load tokenizer
162
+ tokenizer = BPETokenizer("vocab.json", "merges.txt")
163
 
164
+ # Generate text
165
+ prompt = "The nature of reality"
166
+ output = generate(model, ps, st, tokenizer, prompt;
167
+ max_new_tokens=200, temperature=0.8, top_k=40)
168
+ println(output)
 
 
 
 
169
  ```
170
 
171
+ ## References
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
+ - **Symbiogenesis framework**: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis) — Evolutionary NAS via organism fusion
174
+ - **Monarch Mixer**: Dao et al., 2023 — Sub-quadratic GEMM-based sequence mixing
175
+ - **Hyena**: Poli et al., 2023 — Long convolutions for sequence modeling
176
+ - **Endosymbiotic theory**: Margulis, 1967 — Origin of eukaryotic organelles
177
 
178
  ## Citation
179
 
180
  ```bibtex
181
+ @misc{symbio-slm-2026,
182
+ title={SymbioSLM: Multi-Organelle Sequence Mixing for Attention-Free Language Modeling},
183
  author={LisaMegaWatts},
184
  year={2026},
185
  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
186
  }
187
  ```
188
 
 
 
 
 
 
 
 
189
  ## License
190
 
191
  MIT