Add model card with architecture details, provenance, and training metrics

fed3ca7 verified 2 months ago

7.67 kB

language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - symbiogenesis
  - monarch-mixer
  - long-convolution
  - causal-conv
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
model-index:
  - name: SymbioSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 79.9
            name: Val PPL (step 1000)

SymbioSLM

A ~5M parameter decoder-only language model using the Symbiogenesis architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.

Architecture

Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:

SymbioBlock (x6)
+-- RMSNorm
+-- SymbioSequenceMixer
|   +-- Organelle 1: CausalDepthwiseConv1d   (local n-gram patterns, K=4)
|   +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
|   +-- Organelle 3: LongConv                (global dense causal filter)
|   +-- OrganelleGate                        (per-channel softmax fusion)
+-- RMSNorm
+-- SwiGLU FFN

How It Works

CausalConv captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).
Monarch matrices provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).
LongConv learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.
OrganelleGate fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.

No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.

Model Details

Parameter	Value
Architecture	Symbiogenesis (3 organelles + gate)
Parameters	~4.1M
Embed dim	256
Layers	6
Monarch heads	4
Context length	256 tokens
Vocabulary	2,000 (ByteLevel BPE)
FFN	SwiGLU (hidden=640)
Normalization	RMSNorm (pre-norm)
Weight tying	Yes (shared input/output embeddings)
Precision	Float32 (F16 slower for Monarch block sizes)

Parameter Breakdown

Component	Params	%
Token embedding (tied)	512K	12.6%
CausalConv (x6)	6.1K	0.2%
Monarch heads (x6, 4 heads each)	197K	4.8%
LongConv (x6)	393K	9.7%
OrganelleGate (x6)	4.6K	0.1%
SwiGLU FFN (x6)	2.95M	72.6%
RMSNorm (x13)	3.3K	<0.1%
Total	~4.1M

Sequence Mixing Efficiency

	Transformer	Monarch	Symbiogenesis
Seq mixer params/block	262K	67K	100K
Reduction vs Transformer	-	74%	62%
Position encoding	RoPE (separate)	None	None

Training

	Value
Dataset	philosophy-corpus
Corpus	981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens	~100M (Chinchilla-optimal: 20 tok/param)
Optimizer	AdamW (lr=1e-3, min_lr=1e-4, cosine decay)
Batch size	32
Hardware	NVIDIA RTX 3060 12GB
Throughput	~19K tok/s (Float32)
Framework	Julia + Lux.jl + Zygote.jl + CUDA.jl

Training Progress (partial)

Step	Train Loss	Val Loss	Val PPL	Gate Entropy
1	17.10	17.03	24.9M	1.099
500	6.50	4.92	137.5	1.098
1,000	4.43	4.38	79.9	1.094

Gelation Monitoring

Training includes phase transition detection inspired by polymer physics:

CUSUM on loss curvature: Detects sudden changes in 2nd derivative of loss curve
Gate entropy: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
Kuramoto order parameter: Measures synchronization of block dynamics (R > 0.9 = gelation)

Comparison with Other Julia SLM Variants

	JuliaSLM	MonarchSLM	SymbioSLM
Architecture	Transformer	Monarch Mixer	Symbiogenesis
Sequence mixing	4-head attention	8-head Monarch + conv	3 organelles + gate
Parameters	5.04M	4.98M	~4.1M
Layers	6	8	6
Val PPL	34.5	38.4	TBD
Throughput	26K tok/s	19K tok/s	19K tok/s
Position encoding	RoPE	None	None

Usage

Generate with Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT
using .JuliaGPT: Lux, CUDA

tok = BPETokenizer("vocab.json", "merges.txt")
device = Lux.gpu_device()
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)

model = create_model(ModelConfig(;
    arch="symbiogenesis", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    n_monarch_heads=4, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
println(text)

OpenAI-Compatible API

The model is served via SymbioSLM Space:

curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Streaming supported with "stream": true.

Files

File	Description
`final.jld2`	Trained model parameters (JLD2 format)
`config.toml`	Model architecture configuration
`vocab.json`	BPE vocabulary (2000 tokens)
`merges.txt`	BPE merge rules

Biological Inspiration

The architecture is named after Lynn Margulis' theory of symbiogenesis (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.

Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.

Citation

@misc{symbioslm2026,
  title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
}

References

Margulis, L. (1967). On the origin of mitosing cells. J. Theoretical Biology, 14(3), 225-274.
Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023.
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

License

MIT