language:
- en
license: mit
library_name: lux
tags:
- julia
- lux
- slm
- philosophy
- symbiogenesis
- monarch-mixer
- long-convolution
- causal-conv
- rmsnorm
- swiglu
- bpe
- text-generation
pipeline_tag: text-generation
model-index:
- name: SymbioSLM
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: LisaMegaWatts/philosophy-corpus
name: philosophy-corpus
metrics:
- type: perplexity
value: 79.9
name: Val PPL (step 1000)
SymbioSLM
A ~5M parameter decoder-only language model using the Symbiogenesis architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.
Architecture
Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:
SymbioBlock (x6)
+-- RMSNorm
+-- SymbioSequenceMixer
| +-- Organelle 1: CausalDepthwiseConv1d (local n-gram patterns, K=4)
| +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
| +-- Organelle 3: LongConv (global dense causal filter)
| +-- OrganelleGate (per-channel softmax fusion)
+-- RMSNorm
+-- SwiGLU FFN
How It Works
CausalConv captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
Monarch matrices provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
LongConv learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
OrganelleGate fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.
No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.
Model Details
| Parameter | Value |
|---|---|
| Architecture | Symbiogenesis (3 organelles + gate) |
| Parameters | ~4.1M |
| Embed dim | 256 |
| Layers | 6 |
| Monarch heads | 4 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| FFN | SwiGLU (hidden=640) |
| Normalization | RMSNorm (pre-norm) |
| Weight tying | Yes (shared input/output embeddings) |
| Precision | Float32 (F16 slower for Monarch block sizes) |
Parameter Breakdown
| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 12.6% |
| CausalConv (x6) | 6.1K | 0.2% |
| Monarch heads (x6, 4 heads each) | 197K | 4.8% |
| LongConv (x6) | 393K | 9.7% |
| OrganelleGate (x6) | 4.6K | 0.1% |
| SwiGLU FFN (x6) | 2.95M | 72.6% |
| RMSNorm (x13) | 3.3K | <0.1% |
| Total | ~4.1M |
Sequence Mixing Efficiency
| Transformer | Monarch | Symbiogenesis | |
|---|---|---|---|
| Seq mixer params/block | 262K | 67K | 100K |
| Reduction vs Transformer | - | 74% | 62% |
| Position encoding | RoPE (separate) | None | None |
Training
| Value | |
|---|---|
| Dataset | philosophy-corpus |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
| Batch size | 32 |
| Hardware | NVIDIA RTX 3060 12GB |
| Throughput | ~19K tok/s (Float32) |
| Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |
Training Progress (partial)
| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
|---|---|---|---|---|
| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
| 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |
Gelation Monitoring
Training includes phase transition detection inspired by polymer physics:
- CUSUM on loss curvature: Detects sudden changes in 2nd derivative of loss curve
- Gate entropy: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
- Kuramoto order parameter: Measures synchronization of block dynamics (R > 0.9 = gelation)
Comparison with Other Julia SLM Variants
| JuliaSLM | MonarchSLM | SymbioSLM | |
|---|---|---|---|
| Architecture | Transformer | Monarch Mixer | Symbiogenesis |
| Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
| Parameters | 5.04M | 4.98M | ~4.1M |
| Layers | 6 | 8 | 6 |
| Val PPL | 34.5 | 38.4 | TBD |
| Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
| Position encoding | RoPE | None | None |
Usage
Generate with Julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT
using .JuliaGPT: Lux, CUDA
tok = BPETokenizer("vocab.json", "merges.txt")
device = Lux.gpu_device()
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)
model = create_model(ModelConfig(;
arch="symbiogenesis", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
n_monarch_heads=4, conv_kernel_size=4,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
println(text)
OpenAI-Compatible API
The model is served via SymbioSLM Space:
curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
Streaming supported with "stream": true.
Files
| File | Description |
|---|---|
final.jld2 |
Trained model parameters (JLD2 format) |
config.toml |
Model architecture configuration |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
Biological Inspiration
The architecture is named after Lynn Margulis' theory of symbiogenesis (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.
Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.
Citation
@misc{symbioslm2026,
title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
}
References
- Margulis, L. (1967). On the origin of mitosing cells. J. Theoretical Biology, 14(3), 225-274.
- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
- Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
License
MIT