Add model card with architecture details, provenance, and training metrics
Browse files
README.md
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: lux
|
| 6 |
+
tags:
|
| 7 |
+
- julia
|
| 8 |
+
- lux
|
| 9 |
+
- slm
|
| 10 |
+
- philosophy
|
| 11 |
+
- symbiogenesis
|
| 12 |
+
- monarch-mixer
|
| 13 |
+
- long-convolution
|
| 14 |
+
- causal-conv
|
| 15 |
+
- rmsnorm
|
| 16 |
+
- swiglu
|
| 17 |
+
- bpe
|
| 18 |
+
- text-generation
|
| 19 |
+
pipeline_tag: text-generation
|
| 20 |
+
model-index:
|
| 21 |
+
- name: SymbioSLM
|
| 22 |
+
results:
|
| 23 |
+
- task:
|
| 24 |
+
type: text-generation
|
| 25 |
+
name: Text Generation
|
| 26 |
+
dataset:
|
| 27 |
+
type: LisaMegaWatts/philosophy-corpus
|
| 28 |
+
name: philosophy-corpus
|
| 29 |
+
metrics:
|
| 30 |
+
- type: perplexity
|
| 31 |
+
value: 79.9
|
| 32 |
+
name: Val PPL (step 1000)
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
# SymbioSLM
|
| 36 |
+
|
| 37 |
+
A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
|
| 38 |
+
|
| 39 |
+
## Architecture
|
| 40 |
+
|
| 41 |
+
Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
SymbioBlock (x6)
|
| 45 |
+
+-- RMSNorm
|
| 46 |
+
+-- SymbioSequenceMixer
|
| 47 |
+
| +-- Organelle 1: CausalDepthwiseConv1d (local n-gram patterns, K=4)
|
| 48 |
+
| +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
|
| 49 |
+
| +-- Organelle 3: LongConv (global dense causal filter)
|
| 50 |
+
| +-- OrganelleGate (per-channel softmax fusion)
|
| 51 |
+
+-- RMSNorm
|
| 52 |
+
+-- SwiGLU FFN
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### How It Works
|
| 56 |
+
|
| 57 |
+
1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
|
| 58 |
+
|
| 59 |
+
2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
|
| 60 |
+
|
| 61 |
+
3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
|
| 62 |
+
|
| 63 |
+
4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
|
| 64 |
+
|
| 65 |
+
No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
|
| 66 |
+
|
| 67 |
+
## Model Details
|
| 68 |
+
|
| 69 |
+
| Parameter | Value |
|
| 70 |
+
|---|---|
|
| 71 |
+
| Architecture | Symbiogenesis (3 organelles + gate) |
|
| 72 |
+
| Parameters | ~4.1M |
|
| 73 |
+
| Embed dim | 256 |
|
| 74 |
+
| Layers | 6 |
|
| 75 |
+
| Monarch heads | 4 |
|
| 76 |
+
| Context length | 256 tokens |
|
| 77 |
+
| Vocabulary | 2,000 (ByteLevel BPE) |
|
| 78 |
+
| FFN | SwiGLU (hidden=640) |
|
| 79 |
+
| Normalization | RMSNorm (pre-norm) |
|
| 80 |
+
| Weight tying | Yes (shared input/output embeddings) |
|
| 81 |
+
| Precision | Float32 (F16 slower for Monarch block sizes) |
|
| 82 |
+
|
| 83 |
+
### Parameter Breakdown
|
| 84 |
+
|
| 85 |
+
| Component | Params | % |
|
| 86 |
+
|---|---|---|
|
| 87 |
+
| Token embedding (tied) | 512K | 12.6% |
|
| 88 |
+
| CausalConv (x6) | 6.1K | 0.2% |
|
| 89 |
+
| Monarch heads (x6, 4 heads each) | 197K | 4.8% |
|
| 90 |
+
| LongConv (x6) | 393K | 9.7% |
|
| 91 |
+
| OrganelleGate (x6) | 4.6K | 0.1% |
|
| 92 |
+
| SwiGLU FFN (x6) | 2.95M | 72.6% |
|
| 93 |
+
| RMSNorm (x13) | 3.3K | <0.1% |
|
| 94 |
+
| **Total** | **~4.1M** | |
|
| 95 |
+
|
| 96 |
+
### Sequence Mixing Efficiency
|
| 97 |
+
|
| 98 |
+
| | Transformer | Monarch | Symbiogenesis |
|
| 99 |
+
|---|---|---|---|
|
| 100 |
+
| Seq mixer params/block | 262K | 67K | 100K |
|
| 101 |
+
| Reduction vs Transformer | - | 74% | **62%** |
|
| 102 |
+
| Position encoding | RoPE (separate) | None | None |
|
| 103 |
+
|
| 104 |
+
## Training
|
| 105 |
+
|
| 106 |
+
| | Value |
|
| 107 |
+
|---|---|
|
| 108 |
+
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
|
| 109 |
+
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
|
| 110 |
+
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
|
| 111 |
+
| Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
|
| 112 |
+
| Batch size | 32 |
|
| 113 |
+
| Hardware | NVIDIA RTX 3060 12GB |
|
| 114 |
+
| Throughput | ~19K tok/s (Float32) |
|
| 115 |
+
| Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
|
| 116 |
+
|
| 117 |
+
### Training Progress (partial)
|
| 118 |
+
|
| 119 |
+
| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
|
| 120 |
+
|---|---|---|---|---|
|
| 121 |
+
| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
|
| 122 |
+
| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
|
| 123 |
+
| 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
|
| 124 |
+
|
| 125 |
+
### Gelation Monitoring
|
| 126 |
+
|
| 127 |
+
Training includes phase transition detection inspired by polymer physics:
|
| 128 |
+
|
| 129 |
+
- **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
|
| 130 |
+
- **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
|
| 131 |
+
- **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)
|
| 132 |
+
|
| 133 |
+
## Comparison with Other Julia SLM Variants
|
| 134 |
+
|
| 135 |
+
| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
|
| 136 |
+
|---|---|---|---|
|
| 137 |
+
| Architecture | Transformer | Monarch Mixer | Symbiogenesis |
|
| 138 |
+
| Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
|
| 139 |
+
| Parameters | 5.04M | 4.98M | ~4.1M |
|
| 140 |
+
| Layers | 6 | 8 | 6 |
|
| 141 |
+
| Val PPL | **34.5** | 38.4 | TBD |
|
| 142 |
+
| Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
|
| 143 |
+
| Position encoding | RoPE | None | None |
|
| 144 |
+
|
| 145 |
+
## Usage
|
| 146 |
+
|
| 147 |
+
### Generate with Julia
|
| 148 |
+
|
| 149 |
+
```julia
|
| 150 |
+
using Pkg; Pkg.activate("julia-slm")
|
| 151 |
+
include("src/JuliaGPT.jl")
|
| 152 |
+
using .JuliaGPT
|
| 153 |
+
using .JuliaGPT: Lux, CUDA
|
| 154 |
+
|
| 155 |
+
tok = BPETokenizer("vocab.json", "merges.txt")
|
| 156 |
+
device = Lux.gpu_device()
|
| 157 |
+
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
|
| 158 |
+
|
| 159 |
+
model = create_model(ModelConfig(;
|
| 160 |
+
arch="symbiogenesis", vocab_size=vocab_size(tok),
|
| 161 |
+
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
|
| 162 |
+
n_monarch_heads=4, conv_kernel_size=4,
|
| 163 |
+
ffn_mult=4, context_length=256, weight_tying=true,
|
| 164 |
+
))
|
| 165 |
+
|
| 166 |
+
text = generate(model, ps, st, tok, "the nature of ";
|
| 167 |
+
max_new_tokens=200, temperature=0.8, top_k=40)
|
| 168 |
+
println(text)
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### OpenAI-Compatible API
|
| 172 |
+
|
| 173 |
+
The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):
|
| 174 |
+
|
| 175 |
+
```bash
|
| 176 |
+
curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
|
| 177 |
+
-H "Content-Type: application/json" \
|
| 178 |
+
-d '{
|
| 179 |
+
"messages": [{"role": "user", "content": "the nature of"}],
|
| 180 |
+
"max_tokens": 200,
|
| 181 |
+
"temperature": 0.8,
|
| 182 |
+
"top_k": 40
|
| 183 |
+
}'
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
Streaming supported with `"stream": true`.
|
| 187 |
+
|
| 188 |
+
## Files
|
| 189 |
+
|
| 190 |
+
| File | Description |
|
| 191 |
+
|---|---|
|
| 192 |
+
| `final.jld2` | Trained model parameters (JLD2 format) |
|
| 193 |
+
| `config.toml` | Model architecture configuration |
|
| 194 |
+
| `vocab.json` | BPE vocabulary (2000 tokens) |
|
| 195 |
+
| `merges.txt` | BPE merge rules |
|
| 196 |
+
|
| 197 |
+
## Biological Inspiration
|
| 198 |
+
|
| 199 |
+
The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
|
| 200 |
+
|
| 201 |
+
Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.
|
| 202 |
+
|
| 203 |
+
## Citation
|
| 204 |
+
|
| 205 |
+
```bibtex
|
| 206 |
+
@misc{symbioslm2026,
|
| 207 |
+
title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
|
| 208 |
+
author={LisaMegaWatts},
|
| 209 |
+
year={2026},
|
| 210 |
+
url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
|
| 211 |
+
}
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
## References
|
| 215 |
+
|
| 216 |
+
- Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
|
| 217 |
+
- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
|
| 218 |
+
- Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
|
| 219 |
+
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
|
| 220 |
+
|
| 221 |
+
## License
|
| 222 |
+
|
| 223 |
+
MIT
|