LisaMegaWatts commited on
Commit
fed3ca7
·
verified ·
1 Parent(s): 3cd6b53

Add model card with architecture details, provenance, and training metrics

Browse files
Files changed (1) hide show
  1. README.md +223 -0
README.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: lux
6
+ tags:
7
+ - julia
8
+ - lux
9
+ - slm
10
+ - philosophy
11
+ - symbiogenesis
12
+ - monarch-mixer
13
+ - long-convolution
14
+ - causal-conv
15
+ - rmsnorm
16
+ - swiglu
17
+ - bpe
18
+ - text-generation
19
+ pipeline_tag: text-generation
20
+ model-index:
21
+ - name: SymbioSLM
22
+ results:
23
+ - task:
24
+ type: text-generation
25
+ name: Text Generation
26
+ dataset:
27
+ type: LisaMegaWatts/philosophy-corpus
28
+ name: philosophy-corpus
29
+ metrics:
30
+ - type: perplexity
31
+ value: 79.9
32
+ name: Val PPL (step 1000)
33
+ ---
34
+
35
+ # SymbioSLM
36
+
37
+ A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
38
+
39
+ ## Architecture
40
+
41
+ Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
42
+
43
+ ```
44
+ SymbioBlock (x6)
45
+ +-- RMSNorm
46
+ +-- SymbioSequenceMixer
47
+ | +-- Organelle 1: CausalDepthwiseConv1d (local n-gram patterns, K=4)
48
+ | +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
49
+ | +-- Organelle 3: LongConv (global dense causal filter)
50
+ | +-- OrganelleGate (per-channel softmax fusion)
51
+ +-- RMSNorm
52
+ +-- SwiGLU FFN
53
+ ```
54
+
55
+ ### How It Works
56
+
57
+ 1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
58
+
59
+ 2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
60
+
61
+ 3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
62
+
63
+ 4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
64
+
65
+ No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
66
+
67
+ ## Model Details
68
+
69
+ | Parameter | Value |
70
+ |---|---|
71
+ | Architecture | Symbiogenesis (3 organelles + gate) |
72
+ | Parameters | ~4.1M |
73
+ | Embed dim | 256 |
74
+ | Layers | 6 |
75
+ | Monarch heads | 4 |
76
+ | Context length | 256 tokens |
77
+ | Vocabulary | 2,000 (ByteLevel BPE) |
78
+ | FFN | SwiGLU (hidden=640) |
79
+ | Normalization | RMSNorm (pre-norm) |
80
+ | Weight tying | Yes (shared input/output embeddings) |
81
+ | Precision | Float32 (F16 slower for Monarch block sizes) |
82
+
83
+ ### Parameter Breakdown
84
+
85
+ | Component | Params | % |
86
+ |---|---|---|
87
+ | Token embedding (tied) | 512K | 12.6% |
88
+ | CausalConv (x6) | 6.1K | 0.2% |
89
+ | Monarch heads (x6, 4 heads each) | 197K | 4.8% |
90
+ | LongConv (x6) | 393K | 9.7% |
91
+ | OrganelleGate (x6) | 4.6K | 0.1% |
92
+ | SwiGLU FFN (x6) | 2.95M | 72.6% |
93
+ | RMSNorm (x13) | 3.3K | <0.1% |
94
+ | **Total** | **~4.1M** | |
95
+
96
+ ### Sequence Mixing Efficiency
97
+
98
+ | | Transformer | Monarch | Symbiogenesis |
99
+ |---|---|---|---|
100
+ | Seq mixer params/block | 262K | 67K | 100K |
101
+ | Reduction vs Transformer | - | 74% | **62%** |
102
+ | Position encoding | RoPE (separate) | None | None |
103
+
104
+ ## Training
105
+
106
+ | | Value |
107
+ |---|---|
108
+ | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
109
+ | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
110
+ | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
111
+ | Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
112
+ | Batch size | 32 |
113
+ | Hardware | NVIDIA RTX 3060 12GB |
114
+ | Throughput | ~19K tok/s (Float32) |
115
+ | Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
116
+
117
+ ### Training Progress (partial)
118
+
119
+ | Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
120
+ |---|---|---|---|---|
121
+ | 1 | 17.10 | 17.03 | 24.9M | 1.099 |
122
+ | 500 | 6.50 | 4.92 | 137.5 | 1.098 |
123
+ | 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
124
+
125
+ ### Gelation Monitoring
126
+
127
+ Training includes phase transition detection inspired by polymer physics:
128
+
129
+ - **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
130
+ - **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
131
+ - **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)
132
+
133
+ ## Comparison with Other Julia SLM Variants
134
+
135
+ | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
136
+ |---|---|---|---|
137
+ | Architecture | Transformer | Monarch Mixer | Symbiogenesis |
138
+ | Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
139
+ | Parameters | 5.04M | 4.98M | ~4.1M |
140
+ | Layers | 6 | 8 | 6 |
141
+ | Val PPL | **34.5** | 38.4 | TBD |
142
+ | Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
143
+ | Position encoding | RoPE | None | None |
144
+
145
+ ## Usage
146
+
147
+ ### Generate with Julia
148
+
149
+ ```julia
150
+ using Pkg; Pkg.activate("julia-slm")
151
+ include("src/JuliaGPT.jl")
152
+ using .JuliaGPT
153
+ using .JuliaGPT: Lux, CUDA
154
+
155
+ tok = BPETokenizer("vocab.json", "merges.txt")
156
+ device = Lux.gpu_device()
157
+ ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
158
+
159
+ model = create_model(ModelConfig(;
160
+ arch="symbiogenesis", vocab_size=vocab_size(tok),
161
+ embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
162
+ n_monarch_heads=4, conv_kernel_size=4,
163
+ ffn_mult=4, context_length=256, weight_tying=true,
164
+ ))
165
+
166
+ text = generate(model, ps, st, tok, "the nature of ";
167
+ max_new_tokens=200, temperature=0.8, top_k=40)
168
+ println(text)
169
+ ```
170
+
171
+ ### OpenAI-Compatible API
172
+
173
+ The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):
174
+
175
+ ```bash
176
+ curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
177
+ -H "Content-Type: application/json" \
178
+ -d '{
179
+ "messages": [{"role": "user", "content": "the nature of"}],
180
+ "max_tokens": 200,
181
+ "temperature": 0.8,
182
+ "top_k": 40
183
+ }'
184
+ ```
185
+
186
+ Streaming supported with `"stream": true`.
187
+
188
+ ## Files
189
+
190
+ | File | Description |
191
+ |---|---|
192
+ | `final.jld2` | Trained model parameters (JLD2 format) |
193
+ | `config.toml` | Model architecture configuration |
194
+ | `vocab.json` | BPE vocabulary (2000 tokens) |
195
+ | `merges.txt` | BPE merge rules |
196
+
197
+ ## Biological Inspiration
198
+
199
+ The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
200
+
201
+ Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.
202
+
203
+ ## Citation
204
+
205
+ ```bibtex
206
+ @misc{symbioslm2026,
207
+ title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
208
+ author={LisaMegaWatts},
209
+ year={2026},
210
+ url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
211
+ }
212
+ ```
213
+
214
+ ## References
215
+
216
+ - Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
217
+ - Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
218
+ - Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
219
+ - Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
220
+
221
+ ## License
222
+
223
+ MIT