SymbioSLM / README.md
LisaMegaWatts's picture
Add model card with architecture details, provenance, and training metrics
fed3ca7 verified
|
raw
history blame
7.67 kB
metadata
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - symbiogenesis
  - monarch-mixer
  - long-convolution
  - causal-conv
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
model-index:
  - name: SymbioSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 79.9
            name: Val PPL (step 1000)

SymbioSLM

A ~5M parameter decoder-only language model using the Symbiogenesis architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.

Architecture

Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:

SymbioBlock (x6)
+-- RMSNorm
+-- SymbioSequenceMixer
|   +-- Organelle 1: CausalDepthwiseConv1d   (local n-gram patterns, K=4)
|   +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
|   +-- Organelle 3: LongConv                (global dense causal filter)
|   +-- OrganelleGate                        (per-channel softmax fusion)
+-- RMSNorm
+-- SwiGLU FFN

How It Works

  1. CausalConv captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).

  2. Monarch matrices provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).

  3. LongConv learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.

  4. OrganelleGate fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.

No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.

Model Details

Parameter Value
Architecture Symbiogenesis (3 organelles + gate)
Parameters ~4.1M
Embed dim 256
Layers 6
Monarch heads 4
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
FFN SwiGLU (hidden=640)
Normalization RMSNorm (pre-norm)
Weight tying Yes (shared input/output embeddings)
Precision Float32 (F16 slower for Monarch block sizes)

Parameter Breakdown

Component Params %
Token embedding (tied) 512K 12.6%
CausalConv (x6) 6.1K 0.2%
Monarch heads (x6, 4 heads each) 197K 4.8%
LongConv (x6) 393K 9.7%
OrganelleGate (x6) 4.6K 0.1%
SwiGLU FFN (x6) 2.95M 72.6%
RMSNorm (x13) 3.3K <0.1%
Total ~4.1M

Sequence Mixing Efficiency

Transformer Monarch Symbiogenesis
Seq mixer params/block 262K 67K 100K
Reduction vs Transformer - 74% 62%
Position encoding RoPE (separate) None None

Training

Value
Dataset philosophy-corpus
Corpus 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens ~100M (Chinchilla-optimal: 20 tok/param)
Optimizer AdamW (lr=1e-3, min_lr=1e-4, cosine decay)
Batch size 32
Hardware NVIDIA RTX 3060 12GB
Throughput ~19K tok/s (Float32)
Framework Julia + Lux.jl + Zygote.jl + CUDA.jl

Training Progress (partial)

Step Train Loss Val Loss Val PPL Gate Entropy
1 17.10 17.03 24.9M 1.099
500 6.50 4.92 137.5 1.098
1,000 4.43 4.38 79.9 1.094

Gelation Monitoring

Training includes phase transition detection inspired by polymer physics:

  • CUSUM on loss curvature: Detects sudden changes in 2nd derivative of loss curve
  • Gate entropy: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
  • Kuramoto order parameter: Measures synchronization of block dynamics (R > 0.9 = gelation)

Comparison with Other Julia SLM Variants

JuliaSLM MonarchSLM SymbioSLM
Architecture Transformer Monarch Mixer Symbiogenesis
Sequence mixing 4-head attention 8-head Monarch + conv 3 organelles + gate
Parameters 5.04M 4.98M ~4.1M
Layers 6 8 6
Val PPL 34.5 38.4 TBD
Throughput 26K tok/s 19K tok/s 19K tok/s
Position encoding RoPE None None

Usage

Generate with Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT
using .JuliaGPT: Lux, CUDA

tok = BPETokenizer("vocab.json", "merges.txt")
device = Lux.gpu_device()
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)

model = create_model(ModelConfig(;
    arch="symbiogenesis", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    n_monarch_heads=4, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
println(text)

OpenAI-Compatible API

The model is served via SymbioSLM Space:

curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Streaming supported with "stream": true.

Files

File Description
final.jld2 Trained model parameters (JLD2 format)
config.toml Model architecture configuration
vocab.json BPE vocabulary (2000 tokens)
merges.txt BPE merge rules

Biological Inspiration

The architecture is named after Lynn Margulis' theory of symbiogenesis (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.

Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.

Citation

@misc{symbioslm2026,
  title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
}

References

  • Margulis, L. (1967). On the origin of mitosing cells. J. Theoretical Biology, 14(3), 225-274.
  • Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
  • Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023.
  • Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

License

MIT