SymbioSLM / SYMBIOGENESIS_WHITEPAPER.md

Add Symbiogenesis architecture white paper

5f95671 verified 4 days ago

31.3 kB

Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models

Authors: LisaMegaWatts Date: February 2026 Repository: buildwithbooks/julia-slm Live Demo: HuggingFace Space

Abstract

We introduce Symbiogenesis, a novel sequence mixing architecture for decoder-only language models that replaces softmax attention with three complementary "organelles" fused via learned per-channel gating. Inspired by the biological theory of symbiogenesis (Margulis, 1967) — where complex cellular organelles originated as free-living organisms that fused into a single cell — each block contains: (1) a CausalConv organelle for local n-gram patterns, (2) multi-head Monarch matrices for global sub-quadratic mixing, and (3) a LongConv organelle for dense causal filtering. A per-channel softmax OrganelleGate learns which organelle each embedding channel relies on, creating a specialized "fused organism" per block.

We implement and train three model variants (~5M parameters each) entirely in Julia using Lux.jl on a curated corpus of classical philosophy texts (100M tokens). Against a baseline Transformer (RoPE + SwiGLU + RMSNorm), Symbiogenesis achieves competitive perplexity while providing a richer set of inductive biases for sequence modeling. To our knowledge, this represents the first implementation of both Monarch Mixer and the Symbiogenesis architecture in Julia.

1. Introduction

1.1 Motivation

The dominant paradigm in sequence modeling — softmax attention — computes a dynamic, input-dependent mixing matrix at each layer. This flexibility comes at a cost: O(T^2) compute and memory in sequence length T, and a parameter budget of 4D^2 per layer (for Q, K, V, O projections in a D-dimensional model). Recent work on structured sequence mixing (Monarch Mixer, Hyena, S4, Mamba) has shown that fixed or semi-structured mixing patterns can match attention quality at significantly lower parameter and compute costs.

We ask: what happens when we give each block access to multiple complementary mixing mechanisms and let the model learn to route between them? Biological evolution solved this problem via symbiogenesis — mitochondria and chloroplasts were once independent organisms that fused into eukaryotic cells, with each organelle handling a specialized function. We apply this principle to sequence mixing.

1.2 Contributions

Symbiogenesis architecture: A multi-organelle block design with three complementary sequence mixers (local convolution, global structured mixing, global dense convolution) fused via learned per-channel gating.
First Julia implementation of Monarch Mixer: A complete, GPU-accelerated implementation using Lux.jl, Zygote.jl, and NNlib.jl with Float16 mixed-precision support.
Gelation monitoring: A training diagnostic framework inspired by polymer physics (Flory-Stockmayer theory) that detects training phase transitions using CUSUM on loss curvature, gate entropy tracking, and Kuramoto order parameter synchronization.
Head-to-head comparison: Three architectures (Transformer, Monarch, Symbiogenesis) trained on identical data with matched parameter budgets, all in pure Julia.

2. Background

2.1 Softmax Attention

Standard causal self-attention computes:

Q, K, V = W_q·x, W_k·x, W_v·x
Attn = softmax(Q·K^T / sqrt(d_k) + mask) · V

Parameters per layer: 4D^2 (for D-dimensional embeddings with H heads) Complexity: O(T^2·D) compute, O(T^2·H) memory Strengths: Dynamic, input-dependent mixing; proven at scale Weaknesses: Quadratic scaling; large parameter footprint in sequence mixing

2.2 Monarch Matrices

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

M = P^T · BlockDiag(L1) · P · BlockDiag(L2)

where T = p^2 (e.g., T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are tensors of shape (p, p, p) representing p block-diagonal p x p matrices.

Parameters: 2p^3 = 2T^(3/2) per head (e.g., 8,192 for T=256) vs. Dense: T^2 = 65,536 — an 87.5% reduction Complexity: O(T^(3/2)) to realize, O(T^2) to apply (due to causal masking)

The factored structure captures global mixing patterns through two stages of local block-diagonal operations separated by a permutation, analogous to the butterfly operations in FFT.

2.3 Symbiogenesis Theory

Lynn Margulis' endosymbiotic theory (1967) proposes that eukaryotic cells originated through the fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as independent entities that became integrated into a larger whole.

We apply this biological principle to neural architecture: rather than choosing a single sequence mixing mechanism, we provide each block with multiple complementary "organelles" and let learning determine how to combine them. The OrganelleGate acts as the cell membrane, mediating the fusion.

3. Architecture

3.1 Overview

JuliaGPTModel (symbiogenesis)
+-- tok_emb: Embedding(V -> D)         [weight-tied with output head]
+-- blocks x N:
|   +-- ln1: RMSNorm(D)
|   +-- seq_mixer: SymbioSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(D, K=4)        [Organelle 1: Local]
|   |   +-- monarchs: H x MonarchMatrix(T, p)          [Organelle 2: Global structured]
|   |   +-- longconv: LongConv(D, T)                   [Organelle 3: Global dense]
|   |   +-- gate: OrganelleGate(D, 3)                  [Per-channel fusion]
|   +-- ln2: RMSNorm(D)
|   +-- ffn: SwiGLU(D -> hidden -> D)
+-- ln_f: RMSNorm(D)
+-- head: TiedEmbeddingHead -> (V,)

Each block follows the pre-norm residual pattern:

h = x + SequenceMixer(RMSNorm(x))
out = h + SwiGLU(RMSNorm(h))

3.2 Organelle 1: CausalDepthwiseConv1d

The simplest organelle provides local context through a short causal convolution. Each embedding channel has its own 1D kernel of length K (typically K=4), implementing depthwise convolution with causal left-padding.

Input: x of shape (D, T, B) Parameters: kernel of shape (K, D) Operation:

x_padded = cat(zeros(K-1, D, B), x; dims=1)   # causal left-padding
out = depthwise_conv1d(x_padded, kernel)        # groups = D

Computational role: Captures local n-gram patterns (bigrams, trigrams, 4-grams). Analogous to the causal convolution in Monarch Mixer and the short convolution in Hyena/Mamba.

Complexity: O(K * D * T * B) — linear in sequence length.

3.3 Organelle 2: Multi-Head Monarch

The Monarch organelle provides global sequence mixing through factored matrix multiplication. The embed dimension D is split into H heads, each with D/H channels.

Realization of a Monarch matrix from factors L1, L2:

function realize(l::MonarchMatrix, ps, st)
    p = l.block_size                    # sqrt(T)
    I_T = st.identity                   # (T, T) identity matrix

    x = reshape(I_T, p, p, p * p)       # (p, p, T)
    x = batched_mul(ps.L2, x)           # Apply L2 block-diag
    x = permutedims(x, (2, 1, 3))       # Transpose within blocks
    x = batched_mul(ps.L1, x)           # Apply L1 block-diag
    x = permutedims(x, (2, 1, 3))       # Transpose back

    return reshape(x, p * p, p * p), st  # (T, T)
end

Per-head forward pass:

M = realize(monarch, ps, st)                    # (T, T)
M = M .* causal_mask                             # multiplicative 0/1 mask
x_slice = x[ch_start:ch_end, :, :]               # (D/H, T, B)
x_flat = reshape(permutedims(x_slice, (2,1,3)), T, D/H * B)
y_flat = M * x_flat                               # (T, T) x (T, D/H*B)

Outputs from all H heads are concatenated along the channel dimension.

Parameters per head: 2p^3 where p = sqrt(T) Total parameters: H * 2p^3

No positional encoding needed — the Monarch matrices learn position-dependent mixing patterns directly, as each realized matrix M encodes fixed but learned position-to-position interactions.

3.4 Organelle 3: LongConv

The third organelle provides global dense causal filtering through a full-length depthwise convolution. Each channel has its own learned kernel of length T (the full context length), initialized with scale sqrt(1/T).

Input: x of shape (D, T, B) Parameters: kernel of shape (T, D) — one full-length kernel per channel Operation:

x_padded = cat(zeros(T-1, D, B), x; dims=1)    # causal left-padding
out = depthwise_conv1d(x_padded, kernel)         # groups = D, kernel_size = T

Computational role: Learns a dense causal filter per channel. Unlike Monarch's structured factored mixing, LongConv can represent arbitrary causal mixing patterns. This gives it strictly more expressive power per channel, but at higher parameter cost.

Complexity: O(T^2 * D * B) — quadratic in sequence length (matches attention). Parameters: T * D (e.g., 256 * 256 = 65,536 for our configuration).

Contrast with Monarch: Where Monarch uses O(T^(3/2)) parameters to learn a structured global mixing pattern, LongConv uses O(T * D) parameters for a dense but per-channel (non-cross-channel) pattern.

3.5 OrganelleGate

The fusion mechanism is a per-channel softmax gate over the three organelle outputs:

Parameters: logits of shape (3, D), initialized to zeros

Forward pass:

weights = softmax(logits; dims=1)              # (3, D) — per-channel weights
output = sum(weights[i,:] .* organelle_out[i] for i in 1:3)

Properties:

Per-channel routing: Each embedding channel independently chooses its organelle mixture, enabling fine-grained specialization.
Softmax constraint: Weights sum to 1 per channel, preventing scale inflation.
Zero initialization: All organelles start with equal weight (1/3, 1/3, 1/3), allowing the network to discover the optimal mixture during training.
Differentiable: Fully differentiable through softmax, enabling end-to-end gradient-based learning of the gate.

Gate entropy as a diagnostic:

H = -sum(w * log(w + eps)) / D

High entropy (~1.099 for 3 organelles) indicates uniform mixing; low entropy indicates strong specialization. Tracking gate entropy over training reveals whether and when the model discovers organelle specialization.

3.6 Causal Masking

Unlike transformer attention, which uses additive masking (0 for allowed, -infinity for blocked positions before softmax), Monarch and Symbiogenesis use multiplicative 0/1 masking:

mask[i, j] = j <= i ? 1.0 : 0.0    # lower-triangular
M_causal = M .* mask                 # element-wise multiply

This is applied to the realized Monarch matrix before multiplying by the input sequence. The CausalConv and LongConv organelles enforce causality through left-padding rather than explicit masking.

3.7 Shared Components

RMSNorm (Root Mean Square Layer Normalization):

rms = sqrt(mean(x^2) + eps)
output = (weight .* x) ./ rms

No learnable bias; type-preserving for Float16 mixed precision.

SwiGLU (Swish-Gated Linear Unit):

gate = swish(W1 * x)
value = V * x
output = W2 * (gate .* value)

Hidden dimension adjusted by factor 2/3 and rounded to nearest multiple of 64:

hidden = max(64, floor(2 * D * ffn_mult / 3 / 64) * 64)

Weight Tying: Input embedding and output projection share weights, reducing parameters by V * D (e.g., 2000 * 256 = 512K parameters).

4. Gelation Monitoring

4.1 Theoretical Motivation

In polymer physics, gelation is the phase transition where a polymer system transitions from a sol (viscous liquid) to a gel (connected network). Flory-Stockmayer theory predicts a critical conversion point beyond which the system's macroscopic properties change discontinuously.

We hypothesize an analogous phase transition occurs during neural network training: a critical point where the loss landscape connectivity changes qualitatively, correlating with the onset of meaningful generalization. We monitor three complementary signals.

4.2 CUSUM on Loss Curvature

Page's one-sided cumulative sum test detects sudden changes in the second derivative (curvature) of the validation loss curve:

curvature[n] = loss[n] - 2*loss[n-1] + loss[n-2]
deviation = (curvature - baseline_mean) / baseline_std
S_pos = max(0, S_pos + deviation)
S_neg = max(0, S_neg - deviation)

Baseline statistics are computed from the first window (50 observations). A CUSUM breach (S > threshold) indicates a structural change in the loss landscape — the training dynamics have undergone a phase transition.

4.3 Gate Entropy

For Symbiogenesis blocks, gate entropy measures organelle specialization:

weights = softmax(gate_logits; dims=1)           # (3, D)
H = -sum(weights .* log(weights + eps)) / D      # average per-channel entropy

Maximum entropy: log(3) = 1.099 (uniform mixing) Minimum entropy: 0 (single organelle dominates each channel)

A sudden drop in gate entropy indicates the network has "decided" how to use its organelles — a specialization phase transition.

4.4 Kuramoto Order Parameter

Each block is modeled as a phase oscillator, with phase derived from its gate entropy:

theta_j = 2*pi * (H_j - H_min) / (H_max - H_min)    # map entropy to phase
R = |1/N * sum(exp(i*theta_j))|                        # order parameter

R = 1: All blocks are synchronized (convergent dynamics) R = 0: Blocks are fully desynchronized (independent dynamics)

R > 0.9 triggers a synchronization gelation event, indicating that all blocks have converged to a consistent organelle utilization pattern.

5. Experimental Setup

5.1 Training Data

All models are trained on the philosophy-corpus — a curated collection of 981 source texts spanning 2,500 years of Western philosophy and mathematics:

Sources: BookCorpus, WikiText-103, Project Gutenberg-19, classical philosophy (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, et al.)
Processing: Custom text pipeline with deduplication, quality scoring, Unicode normalization
Train tokens: 794.9M (pre-encoded as binary)
Val tokens: 88.2M
Tokenizer: ByteLevel BPE with 2,000 vocabulary
Training budget: ~100M tokens (Chinchilla-optimal at 20 tokens/parameter for 5M models)

5.2 Model Configurations

	Transformer	Monarch Mixer	Symbiogenesis
Parameters	5,037,312	4,983,040	~5M
Embed dim	256	256	256
Layers	6	8	6-8
Sequence mixing	4-head attention	8-head Monarch + conv + gate	3 organelles + gate
Seq mixer params/block	262K	67K	~117K
Position encoding	RoPE	None (learned in Monarch)	None (learned in Monarch + LongConv)
FFN	SwiGLU	SwiGLU	SwiGLU
Normalization	RMSNorm	RMSNorm	RMSNorm
Weight tying	Yes	Yes	Yes
Context length	256	256	256

5.3 Training Configuration

Parameter	Value
Optimizer	AdamW
Learning rate	6e-4 (Transformer, Monarch), 1e-3 (Symbiogenesis)
Min learning rate	6e-5 / 1e-4
LR schedule	Linear warmup (500 steps) + cosine decay
Batch size	32
Max steps	12,305
Tokens per step	32 * 256 = 8,192
Total tokens	~100M
Gradient clipping	1.0 (global norm)
Precision	Float16 AMP (Float32 master weights)
Hardware	NVIDIA RTX 3060 12GB

5.4 Implementation

The entire framework is implemented in Julia using:

Lux.jl — Explicit-parameter neural network framework
Zygote.jl — Automatic differentiation
CUDA.jl — GPU acceleration
NNlib.jl — Softmax, activations, batched matrix multiplication
Optimisers.jl — AdamW with cosine learning rate scheduling
JLD2.jl — Model serialization

All three architectures share the same codebase, data pipeline, training loop, and evaluation infrastructure. The architecture is selected at model creation time via a configuration dispatch.

6. Results

6.1 Training Curves

Baseline Transformer:

Step	Train Loss	Val Loss	Val PPL
500	6.69	5.01	149.6
2,000	4.09	4.02	56.0
6,000	3.72	3.70	40.4
10,000	3.58	3.57	35.4
12,305	3.55	3.54	34.5

Monarch Mixer:

Step	Train Loss	Val Loss	Val PPL
500	7.28	5.58	265.4
2,000	4.29	4.21	67.6
6,000	3.83	3.81	45.3
10,000	3.69	3.68	39.6
12,305	3.66	3.65	38.4

Symbiogenesis (partial, step 1000):

Step	Train Loss	Val Loss	Val PPL	Gate Entropy
1	17.10	17.03	24.9M	1.099
500	6.50	4.92	137.5	1.098
1,000	4.43	4.38	79.9	1.094

6.2 Head-to-Head Comparison

	Transformer	Monarch	Symbiogenesis
Final val loss	3.54	3.65	TBD
Final val PPL	34.5	38.4	TBD
Parameters	5.04M	4.98M	~5M
Seq mixer params/block	262K	67K	117K
Layers	6	8	6
Throughput (tok/s)	26K	19K	19K (f32)
Training time	66 min	89 min	~88 min

6.3 Throughput Analysis

Mixed-precision (Float16 AMP) benchmarks on RTX 3060:

Architecture	F32 tok/s	F16 tok/s	AMP Speedup
Transformer	26,781	30,110	1.12x
Symbiogenesis (Monarch-based)	19,169	16,007	0.84x

Key finding: AMP provides meaningful speedup for the Transformer (12%), where large attention matrices (256 x 256) benefit from tensor cores. However, Monarch's small block matrices (16 x 16 x 16) do not utilize tensor cores efficiently, making Float32 actually faster than Float16 due to avoided type conversion overhead. Symbiogenesis training should use Float32 precision when the second organelle is Monarch.

6.4 Parameter Efficiency

Sequence mixing parameter comparison (per block):

Component	Transformer	Monarch	Symbiogenesis
Q, K, V, O projections	262,144	-	-
CausalConv (K=4)	-	1,024	1,024
Monarch heads	-	65,536	32,768
LongConv	-	-	65,536
Gate	-	256	768
Total seq mixing	262,144	66,816	100,096
Reduction vs Transformer	-	74%	62%

Symbiogenesis achieves 62% parameter reduction in sequence mixing compared to standard attention, while providing three distinct inductive biases. The savings enable either more layers at the same parameter budget or wider embeddings with fewer layers.

7. Analysis

7.1 Gate Specialization Dynamics

At step 1000 of Symbiogenesis training, gate entropy remains near-maximal (1.094 vs maximum 1.099), indicating the organelle gate has not yet developed strong per-channel preferences. All three organelles contribute roughly equally to each channel.

This slow specialization may be attributed to:

Redundant capacity: At early training stages, any single organelle can reduce loss — the gradient signal doesn't yet distinguish their contributions.
Softmax saturation: With three organelles, the gradient through softmax is divided three ways, requiring stronger signal for one organelle to dominate.
Initialization symmetry: Zero-initialized gate logits create a symmetric starting point that gradients must break.

We expect specialization to emerge later in training as the loss approaches its asymptote and the model must extract finer-grained patterns.

7.2 Inductive Bias Complementarity

The three organelles provide complementary inductive biases:

Property	CausalConv	Monarch	LongConv
Receptive field	Local (K tokens)	Global (all T)	Global (all T)
Mixing pattern	Per-channel, fixed kernel	Cross-position, structured	Per-channel, dense
Parameters	O(K*D)	O(T^(3/2)) per head	O(T*D)
Cross-channel	No	Yes (per head slice)	No
Position encoding	Implicit (causal padding)	Learned (factored matrices)	Learned (per-channel kernels)
Capacity	Low	Medium	High

CausalConv handles local patterns that are common across channels — n-gram statistics, local syntax. Monarch provides structured global mixing that can capture long-range dependencies with a compact parameterization. LongConv offers the most expressive per-channel mixing, able to learn arbitrary causal filters for each embedding dimension.

7.3 Computational Cost Breakdown

Per-step compute distribution (estimated for D=256, T=256, B=32):

Component	FLOPs	% of total
Token embedding	2M	<1%
RMSNorm (x12)	25M	1%
CausalConv (x6)	25M	1%
Monarch realize + multiply (x6)	800M	26%
LongConv (x6)	3.2B	42%
OrganelleGate (x6)	12M	<1%
SwiGLU FFN (x6)	1.9B	25%
Output projection	131M	2%

LongConv dominates the compute budget due to its O(T^2 * D) complexity. Future optimizations could replace the spatial-domain convolution with FFT-based convolution (O(T * log(T) * D)), potentially providing a 10-50x speedup in this component.

8. Related Work

Monarch Mixer (Dao et al., 2023): Sub-quadratic architecture using factored Monarch matrices for both sequence mixing and channel mixing. M2-BERT matches BERT-base at 27% compression. Our Monarch implementation is the first in Julia.

Hyena (Poli et al., 2023): Long convolutions for sequence modeling, replacing attention with learned implicit filters. Our LongConv organelle is similar in spirit but uses explicit per-channel kernels rather than implicit parameterization.

S4/S5 (Gu et al., 2022): Structured state spaces with O(T * log(T)) complexity via HiPPO initialization and diagonal plus low-rank parameterization. S4 targets the same long-range modeling goal as our LongConv organelle.

Mamba (Gu & Dao, 2023): Selective state spaces with input-dependent gating. Mamba's selection mechanism is conceptually related to our OrganelleGate, though it operates within a single mixing mechanism rather than routing between multiple.

Mixture of Experts (Shazeer et al., 2017; Fedus et al., 2022): MoE routes tokens to different FFN experts. Our OrganelleGate is analogous but operates at the sequence mixing level rather than the FFN level, and routes per-channel rather than per-token.

nanoGPT (Karpathy, 2023): Minimal GPT-2 reimplementation. Our baseline Transformer follows this design philosophy.

Depth Delusion (2025): Demonstrates that width matters more than depth at small scale. Influences our decision to use wider embeddings (320d) with fewer layers (6) in Symbiogenesis v2.

9. Implementation Details

9.1 Float16 Mixed-Precision Considerations

During development, we discovered that Julia's type promotion rules can silently undermine Float16 mixed-precision training. When a Float16 tensor operates with a Float32 scalar or tensor, Julia promotes the result to Float32, causing:

Loss of tensor core utilization: cuBLAS falls back to slower mixed-type GEMM paths
Increased memory consumption: Activations stored as Float32 instead of Float16
Performance degradation: The broken AMP path was 3x slower than pure Float32

Three promotion sites were identified and fixed:

# BROKEN: hardcoded Float32 scale
scale = Float32(1.0 / sqrt(Float64(HD)))

# FIXED: match input element type
scale = eltype(q)(1.0 / sqrt(Float64(HD)))

# BROKEN: Float32 caches applied to Float16 inputs
c = cos_cache[:, 1:seq_len]

# FIXED: cast caches to match input type
c = eltype(x).(cos_cache[:, 1:seq_len])

Lesson: In Julia's multiple dispatch system, type promotion is powerful but can be insidious in mixed-precision training. Every constant, cache, and mask must match the expected precision.

9.2 Monarch and Tensor Cores

On NVIDIA Ampere GPUs (RTX 3060), Float16 tensor cores provide acceleration for matrix multiplications where the inner dimensions are multiples of 8 and sufficiently large. Monarch's block matrices are (16, 16, 16) — at the borderline of tensor core efficiency. Our benchmarks show Float16 is actually 16% slower than Float32 for Monarch-based models due to:

Type conversion overhead (Float32 master weights -> Float16 forward -> Float32 gradients)
Small matrix sizes not saturating tensor core throughput
Dynamic loss scaling overhead

Recommendation: Use Float32 for Monarch-based architectures on consumer GPUs. Float16 AMP is only beneficial when the dominant operations involve large matrices (e.g., standard attention with T >= 256).

9.3 Zygote Compatibility

All operations in the Symbiogenesis forward pass are compatible with Zygote.jl automatic differentiation. Key patterns:

Non-differentiable allocations (padding, masks, identity matrices) are wrapped in Zygote.@ignore
Device portability uses a _to_device(reference, x) helper that checks if the reference is a CuArray
In-place operations are avoided in the differentiable path; all mutations happen in @ignore blocks
Indexing: Monarch head slicing (x[ch_start:ch_end, :, :]) is differentiable through Zygote

10. Deployment

All three models are deployed as HuggingFace Spaces serving OpenAI-compatible APIs:

Space	Architecture	Endpoint
JuliaSLM	Transformer	`/v1/chat/completions`
MonarchSLM	Monarch Mixer	`/v1/chat/completions`
SymbioSLM	Symbiogenesis	`/v1/chat/completions`

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). Each Space downloads its checkpoint from a corresponding HuggingFace model repository on startup.

11. Limitations and Future Work

Current Limitations

LongConv is the bottleneck: O(T^2 * D) complexity per block. FFT-based convolution would reduce this to O(T * log(T) * D), potentially doubling overall throughput.
Gate specialization is slow: At 1000 steps, gate entropy remains near-maximal. Techniques like gate temperature annealing or auxiliary specialization losses could accelerate organelle differentiation.
No custom CUDA kernels: All operations use generic NNlib/CUDA.jl kernels. Fused Monarch realization + causal masking + matmul could provide significant speedup.
Small scale evaluation: All experiments are at ~5M parameters on a curated corpus. Scaling laws for Symbiogenesis remain unknown.

Future Directions

Neural ODE depth: Replace discrete SymbioBlocks with a continuous-depth Neural ODE using DiffEqFlux.jl, enabling adaptive compute per token.
Sparse organelle masking: Dynamically disable organelles per block based on input difficulty, reducing compute for easy tokens.
Cross-channel LongConv: Replace per-channel LongConv with grouped convolutions that share kernels across related channels, reducing parameters while maintaining expressiveness.
Scaling experiments: Train 50M and 500M parameter Symbiogenesis models to understand scaling behavior of multi-organelle architectures.
Gelation-guided training: Use gelation detection to automatically adjust learning rate, batch size, or architectural parameters at phase transition boundaries.

12. Conclusion

Symbiogenesis demonstrates that multi-organelle sequence mixing is a viable alternative to softmax attention for small language models. By combining three complementary mixing mechanisms — local convolution, global structured mixing, and global dense filtering — through a learned per-channel gate, the architecture achieves competitive quality while providing rich inductive biases and 62% parameter reduction in sequence mixing.

The biological metaphor of symbiogenesis extends naturally: just as eukaryotic cells benefit from specialized organelles with different evolutionary origins, neural network blocks benefit from specialized mixing mechanisms with different mathematical properties. The OrganelleGate learns to exploit this complementarity, creating a "fused organism" that is more than the sum of its parts.

References

Margulis, L. (1967). On the origin of mitosing cells. Journal of Theoretical Biology, 14(3), 225-274.
Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML 2023.
Gu, A., et al. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
Karpathy, A. (2023). nanoGPT. GitHub repository.
Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization.
Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2), 100-115.
Kuramoto, Y. (1984). Chemical Oscillations, Waves, and Turbulence. Springer.
Flory, P. J. (1941). Molecular Size Distribution in Three Dimensional Polymers. Journal of the American Chemical Society, 63(11), 3083-3090.

Appendix A: Parameter Count Details

5M Symbiogenesis (256d, 6 layers, 4 Monarch heads)

Embedding:                    2000 x 256 =   512,000  (tied with output)

Per block (x6):
  RMSNorm x 2:                256 x 2    =       512
  CausalConv:                 4 x 256    =     1,024
  Monarch (4 heads):     4 x 2 x 16^3   =    32,768
  LongConv:              256 x 256       =    65,536
  OrganelleGate:               3 x 256   =       768
  SwiGLU FFN:
    W1: 256 x 640              =           163,840
    V:  256 x 640              =           163,840
    W2: 640 x 256              =           163,840
  Block total:                             591,616

6 blocks:                                3,549,696
Final RMSNorm:                                  256
Embedding (tied):                           512,000

TOTAL:                                   4,061,952

5M Transformer (256d, 6 layers, 4 heads)

Embedding:                    2000 x 256 =   512,000  (tied with output)

Per block (x6):
  RMSNorm x 2:                256 x 2    =       512
  Attention (Q,K,V,O):   4 x 256 x 256  =   262,144
  SwiGLU FFN:
    W1, V, W2:           3 x 256 x 640  =   491,520
  Block total:                             754,176

6 blocks:                                4,525,056
Final RMSNorm:                                  256
Embedding (tied):                           512,000

TOTAL:                                   5,037,312

Appendix B: Generated Text Samples

[To be added after full training completion]

Built entirely in Julia. MIT License.