SYMBIOGENESIS_WHITEPAPER.md · LisaMegaWatts/SymbioSLM at main

File size: 31,284 Bytes

5f95671

# Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models

**Authors:** LisaMegaWatts
**Date:** February 2026
**Repository:** [buildwithbooks/julia-slm](https://github.com/buildwithbooks/julia-slm)
**Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM)

---

## Abstract

We introduce Symbiogenesis, a novel sequence mixing architecture for decoder-only language models that replaces softmax attention with three complementary "organelles" fused via learned per-channel gating. Inspired by the biological theory of symbiogenesis (Margulis, 1967) — where complex cellular organelles originated as free-living organisms that fused into a single cell — each block contains: (1) a CausalConv organelle for local n-gram patterns, (2) multi-head Monarch matrices for global sub-quadratic mixing, and (3) a LongConv organelle for dense causal filtering. A per-channel softmax OrganelleGate learns which organelle each embedding channel relies on, creating a specialized "fused organism" per block.

We implement and train three model variants (~5M parameters each) entirely in Julia using Lux.jl on a curated corpus of classical philosophy texts (100M tokens). Against a baseline Transformer (RoPE + SwiGLU + RMSNorm), Symbiogenesis achieves competitive perplexity while providing a richer set of inductive biases for sequence modeling. To our knowledge, this represents the first implementation of both Monarch Mixer and the Symbiogenesis architecture in Julia.

---

## 1. Introduction

### 1.1 Motivation

The dominant paradigm in sequence modeling — softmax attention — computes a dynamic, input-dependent mixing matrix at each layer. This flexibility comes at a cost: O(T^2) compute and memory in sequence length T, and a parameter budget of 4D^2 per layer (for Q, K, V, O projections in a D-dimensional model). Recent work on structured sequence mixing (Monarch Mixer, Hyena, S4, Mamba) has shown that fixed or semi-structured mixing patterns can match attention quality at significantly lower parameter and compute costs.

We ask: **what happens when we give each block access to multiple complementary mixing mechanisms and let the model learn to route between them?** Biological evolution solved this problem via symbiogenesis — mitochondria and chloroplasts were once independent organisms that fused into eukaryotic cells, with each organelle handling a specialized function. We apply this principle to sequence mixing.

### 1.2 Contributions

1. **Symbiogenesis architecture**: A multi-organelle block design with three complementary sequence mixers (local convolution, global structured mixing, global dense convolution) fused via learned per-channel gating.

2. **First Julia implementation of Monarch Mixer**: A complete, GPU-accelerated implementation using Lux.jl, Zygote.jl, and NNlib.jl with Float16 mixed-precision support.

3. **Gelation monitoring**: A training diagnostic framework inspired by polymer physics (Flory-Stockmayer theory) that detects training phase transitions using CUSUM on loss curvature, gate entropy tracking, and Kuramoto order parameter synchronization.

4. **Head-to-head comparison**: Three architectures (Transformer, Monarch, Symbiogenesis) trained on identical data with matched parameter budgets, all in pure Julia.

---

## 2. Background

### 2.1 Softmax Attention

Standard causal self-attention computes:

```
Q, K, V = W_q·x, W_k·x, W_v·x
Attn = softmax(Q·K^T / sqrt(d_k) + mask) · V
```

**Parameters per layer:** 4D^2 (for D-dimensional embeddings with H heads)
**Complexity:** O(T^2·D) compute, O(T^2·H) memory
**Strengths:** Dynamic, input-dependent mixing; proven at scale
**Weaknesses:** Quadratic scaling; large parameter footprint in sequence mixing

### 2.2 Monarch Matrices

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

```
M = P^T · BlockDiag(L1) · P · BlockDiag(L2)
```

where T = p^2 (e.g., T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are tensors of shape (p, p, p) representing p block-diagonal p x p matrices.

**Parameters:** 2p^3 = 2T^(3/2) per head (e.g., 8,192 for T=256)
**vs. Dense:** T^2 = 65,536 — an **87.5% reduction**
**Complexity:** O(T^(3/2)) to realize, O(T^2) to apply (due to causal masking)

The factored structure captures global mixing patterns through two stages of local block-diagonal operations separated by a permutation, analogous to the butterfly operations in FFT.

### 2.3 Symbiogenesis Theory

Lynn Margulis' endosymbiotic theory (1967) proposes that eukaryotic cells originated through the fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as independent entities that became integrated into a larger whole.

We apply this biological principle to neural architecture: rather than choosing a single sequence mixing mechanism, we provide each block with multiple complementary "organelles" and let learning determine how to combine them. The OrganelleGate acts as the cell membrane, mediating the fusion.

---

## 3. Architecture

### 3.1 Overview

```
JuliaGPTModel (symbiogenesis)
+-- tok_emb: Embedding(V -> D)         [weight-tied with output head]
+-- blocks x N:
|   +-- ln1: RMSNorm(D)
|   +-- seq_mixer: SymbioSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(D, K=4)        [Organelle 1: Local]
|   |   +-- monarchs: H x MonarchMatrix(T, p)          [Organelle 2: Global structured]
|   |   +-- longconv: LongConv(D, T)                   [Organelle 3: Global dense]
|   |   +-- gate: OrganelleGate(D, 3)                  [Per-channel fusion]
|   +-- ln2: RMSNorm(D)
|   +-- ffn: SwiGLU(D -> hidden -> D)
+-- ln_f: RMSNorm(D)
+-- head: TiedEmbeddingHead -> (V,)
```

Each block follows the pre-norm residual pattern:

```
h = x + SequenceMixer(RMSNorm(x))
out = h + SwiGLU(RMSNorm(h))
```

### 3.2 Organelle 1: CausalDepthwiseConv1d

The simplest organelle provides local context through a short causal convolution. Each embedding channel has its own 1D kernel of length K (typically K=4), implementing depthwise convolution with causal left-padding.

**Input:** x of shape (D, T, B)
**Parameters:** kernel of shape (K, D)
**Operation:**
```
x_padded = cat(zeros(K-1, D, B), x; dims=1)   # causal left-padding
out = depthwise_conv1d(x_padded, kernel)        # groups = D
```

**Computational role:** Captures local n-gram patterns (bigrams, trigrams, 4-grams). Analogous to the causal convolution in Monarch Mixer and the short convolution in Hyena/Mamba.

**Complexity:** O(K * D * T * B) — linear in sequence length.

### 3.3 Organelle 2: Multi-Head Monarch

The Monarch organelle provides global sequence mixing through factored matrix multiplication. The embed dimension D is split into H heads, each with D/H channels.

**Realization** of a Monarch matrix from factors L1, L2:

```julia
function realize(l::MonarchMatrix, ps, st)
    p = l.block_size                    # sqrt(T)
    I_T = st.identity                   # (T, T) identity matrix

    x = reshape(I_T, p, p, p * p)       # (p, p, T)
    x = batched_mul(ps.L2, x)           # Apply L2 block-diag
    x = permutedims(x, (2, 1, 3))       # Transpose within blocks
    x = batched_mul(ps.L1, x)           # Apply L1 block-diag
    x = permutedims(x, (2, 1, 3))       # Transpose back

    return reshape(x, p * p, p * p), st  # (T, T)
end
```

**Per-head forward pass:**

```julia
M = realize(monarch, ps, st)                    # (T, T)
M = M .* causal_mask                             # multiplicative 0/1 mask
x_slice = x[ch_start:ch_end, :, :]               # (D/H, T, B)
x_flat = reshape(permutedims(x_slice, (2,1,3)), T, D/H * B)
y_flat = M * x_flat                               # (T, T) x (T, D/H*B)
```

Outputs from all H heads are concatenated along the channel dimension.

**Parameters per head:** 2p^3 where p = sqrt(T)
**Total parameters:** H * 2p^3

**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly, as each realized matrix M encodes fixed but learned position-to-position interactions.

### 3.4 Organelle 3: LongConv

The third organelle provides global dense causal filtering through a full-length depthwise convolution. Each channel has its own learned kernel of length T (the full context length), initialized with scale sqrt(1/T).

**Input:** x of shape (D, T, B)
**Parameters:** kernel of shape (T, D) — one full-length kernel per channel
**Operation:**
```
x_padded = cat(zeros(T-1, D, B), x; dims=1)    # causal left-padding
out = depthwise_conv1d(x_padded, kernel)         # groups = D, kernel_size = T
```

**Computational role:** Learns a dense causal filter per channel. Unlike Monarch's structured factored mixing, LongConv can represent arbitrary causal mixing patterns. This gives it strictly more expressive power per channel, but at higher parameter cost.

**Complexity:** O(T^2 * D * B) — quadratic in sequence length (matches attention).
**Parameters:** T * D (e.g., 256 * 256 = 65,536 for our configuration).

**Contrast with Monarch:** Where Monarch uses O(T^(3/2)) parameters to learn a structured global mixing pattern, LongConv uses O(T * D) parameters for a dense but per-channel (non-cross-channel) pattern.

### 3.5 OrganelleGate

The fusion mechanism is a per-channel softmax gate over the three organelle outputs:

**Parameters:** logits of shape (3, D), initialized to zeros

**Forward pass:**
```julia
weights = softmax(logits; dims=1)              # (3, D) — per-channel weights
output = sum(weights[i,:] .* organelle_out[i] for i in 1:3)
```

**Properties:**
- **Per-channel routing:** Each embedding channel independently chooses its organelle mixture, enabling fine-grained specialization.
- **Softmax constraint:** Weights sum to 1 per channel, preventing scale inflation.
- **Zero initialization:** All organelles start with equal weight (1/3, 1/3, 1/3), allowing the network to discover the optimal mixture during training.
- **Differentiable:** Fully differentiable through softmax, enabling end-to-end gradient-based learning of the gate.

**Gate entropy** as a diagnostic:
```
H = -sum(w * log(w + eps)) / D
```
High entropy (~1.099 for 3 organelles) indicates uniform mixing; low entropy indicates strong specialization. Tracking gate entropy over training reveals whether and when the model discovers organelle specialization.

### 3.6 Causal Masking

Unlike transformer attention, which uses additive masking (0 for allowed, -infinity for blocked positions before softmax), Monarch and Symbiogenesis use **multiplicative 0/1 masking**:

```julia
mask[i, j] = j <= i ? 1.0 : 0.0    # lower-triangular
M_causal = M .* mask                 # element-wise multiply
```

This is applied to the realized Monarch matrix before multiplying by the input sequence. The CausalConv and LongConv organelles enforce causality through left-padding rather than explicit masking.

### 3.7 Shared Components

**RMSNorm** (Root Mean Square Layer Normalization):
```
rms = sqrt(mean(x^2) + eps)
output = (weight .* x) ./ rms
```
No learnable bias; type-preserving for Float16 mixed precision.

**SwiGLU** (Swish-Gated Linear Unit):
```
gate = swish(W1 * x)
value = V * x
output = W2 * (gate .* value)
```
Hidden dimension adjusted by factor 2/3 and rounded to nearest multiple of 64:
```
hidden = max(64, floor(2 * D * ffn_mult / 3 / 64) * 64)
```

**Weight Tying:** Input embedding and output projection share weights, reducing parameters by V * D (e.g., 2000 * 256 = 512K parameters).

---

## 4. Gelation Monitoring

### 4.1 Theoretical Motivation

In polymer physics, gelation is the phase transition where a polymer system transitions from a sol (viscous liquid) to a gel (connected network). Flory-Stockmayer theory predicts a critical conversion point beyond which the system's macroscopic properties change discontinuously.

We hypothesize an analogous phase transition occurs during neural network training: a critical point where the loss landscape connectivity changes qualitatively, correlating with the onset of meaningful generalization. We monitor three complementary signals.

### 4.2 CUSUM on Loss Curvature

Page's one-sided cumulative sum test detects sudden changes in the second derivative (curvature) of the validation loss curve:

```
curvature[n] = loss[n] - 2*loss[n-1] + loss[n-2]
deviation = (curvature - baseline_mean) / baseline_std
S_pos = max(0, S_pos + deviation)
S_neg = max(0, S_neg - deviation)
```

Baseline statistics are computed from the first window (50 observations). A CUSUM breach (S > threshold) indicates a structural change in the loss landscape — the training dynamics have undergone a phase transition.

### 4.3 Gate Entropy

For Symbiogenesis blocks, gate entropy measures organelle specialization:

```
weights = softmax(gate_logits; dims=1)           # (3, D)
H = -sum(weights .* log(weights + eps)) / D      # average per-channel entropy
```

**Maximum entropy:** log(3) = 1.099 (uniform mixing)
**Minimum entropy:** 0 (single organelle dominates each channel)

A sudden drop in gate entropy indicates the network has "decided" how to use its organelles — a specialization phase transition.

### 4.4 Kuramoto Order Parameter

Each block is modeled as a phase oscillator, with phase derived from its gate entropy:

```
theta_j = 2*pi * (H_j - H_min) / (H_max - H_min)    # map entropy to phase
R = |1/N * sum(exp(i*theta_j))|                        # order parameter
```

**R = 1:** All blocks are synchronized (convergent dynamics)
**R = 0:** Blocks are fully desynchronized (independent dynamics)

R > 0.9 triggers a synchronization gelation event, indicating that all blocks have converged to a consistent organelle utilization pattern.

---

## 5. Experimental Setup

### 5.1 Training Data

All models are trained on the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) — a curated collection of 981 source texts spanning 2,500 years of Western philosophy and mathematics:

- **Sources:** BookCorpus, WikiText-103, Project Gutenberg-19, classical philosophy (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, et al.)
- **Processing:** Custom text pipeline with deduplication, quality scoring, Unicode normalization
- **Train tokens:** 794.9M (pre-encoded as binary)
- **Val tokens:** 88.2M
- **Tokenizer:** ByteLevel BPE with 2,000 vocabulary
- **Training budget:** ~100M tokens (Chinchilla-optimal at 20 tokens/parameter for 5M models)

### 5.2 Model Configurations

| | Transformer | Monarch Mixer | Symbiogenesis |
|---|---|---|---|
| **Parameters** | 5,037,312 | 4,983,040 | ~5M |
| **Embed dim** | 256 | 256 | 256 |
| **Layers** | 6 | 8 | 6-8 |
| **Sequence mixing** | 4-head attention | 8-head Monarch + conv + gate | 3 organelles + gate |
| **Seq mixer params/block** | 262K | 67K | ~117K |
| **Position encoding** | RoPE | None (learned in Monarch) | None (learned in Monarch + LongConv) |
| **FFN** | SwiGLU | SwiGLU | SwiGLU |
| **Normalization** | RMSNorm | RMSNorm | RMSNorm |
| **Weight tying** | Yes | Yes | Yes |
| **Context length** | 256 | 256 | 256 |

### 5.3 Training Configuration

| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 6e-4 (Transformer, Monarch), 1e-3 (Symbiogenesis) |
| Min learning rate | 6e-5 / 1e-4 |
| LR schedule | Linear warmup (500 steps) + cosine decay |
| Batch size | 32 |
| Max steps | 12,305 |
| Tokens per step | 32 * 256 = 8,192 |
| Total tokens | ~100M |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |

### 5.4 Implementation

The entire framework is implemented in Julia using:

- **Lux.jl** — Explicit-parameter neural network framework
- **Zygote.jl** — Automatic differentiation
- **CUDA.jl** — GPU acceleration
- **NNlib.jl** — Softmax, activations, batched matrix multiplication
- **Optimisers.jl** — AdamW with cosine learning rate scheduling
- **JLD2.jl** — Model serialization

All three architectures share the same codebase, data pipeline, training loop, and evaluation infrastructure. The architecture is selected at model creation time via a configuration dispatch.

---

## 6. Results

### 6.1 Training Curves

**Baseline Transformer:**

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | **3.54** | **34.5** |

**Monarch Mixer:**

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | **3.65** | **38.4** |

**Symbiogenesis (partial, step 1000):**

| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
|---|---|---|---|---|
| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
| 1,000 | 4.43 | **4.38** | **79.9** | 1.094 |

### 6.2 Head-to-Head Comparison

| | Transformer | Monarch | Symbiogenesis |
|---|---|---|---|
| Final val loss | **3.54** | 3.65 | TBD |
| Final val PPL | **34.5** | 38.4 | TBD |
| Parameters | 5.04M | 4.98M | ~5M |
| Seq mixer params/block | 262K | **67K** | 117K |
| Layers | 6 | 8 | 6 |
| Throughput (tok/s) | **26K** | 19K | 19K (f32) |
| Training time | **66 min** | 89 min | ~88 min |

### 6.3 Throughput Analysis

Mixed-precision (Float16 AMP) benchmarks on RTX 3060:

| Architecture | F32 tok/s | F16 tok/s | AMP Speedup |
|---|---|---|---|
| Transformer | 26,781 | **30,110** | **1.12x** |
| Symbiogenesis (Monarch-based) | **19,169** | 16,007 | 0.84x |

**Key finding:** AMP provides meaningful speedup for the Transformer (12%), where large attention matrices (256 x 256) benefit from tensor cores. However, Monarch's small block matrices (16 x 16 x 16) do not utilize tensor cores efficiently, making Float32 actually faster than Float16 due to avoided type conversion overhead. Symbiogenesis training should use Float32 precision when the second organelle is Monarch.

### 6.4 Parameter Efficiency

Sequence mixing parameter comparison (per block):

| Component | Transformer | Monarch | Symbiogenesis |
|---|---|---|---|
| Q, K, V, O projections | 262,144 | - | - |
| CausalConv (K=4) | - | 1,024 | 1,024 |
| Monarch heads | - | 65,536 | 32,768 |
| LongConv | - | - | 65,536 |
| Gate | - | 256 | 768 |
| **Total seq mixing** | **262,144** | **66,816** | **100,096** |
| **Reduction vs Transformer** | - | **74%** | **62%** |

Symbiogenesis achieves 62% parameter reduction in sequence mixing compared to standard attention, while providing three distinct inductive biases. The savings enable either more layers at the same parameter budget or wider embeddings with fewer layers.

---

## 7. Analysis

### 7.1 Gate Specialization Dynamics

At step 1000 of Symbiogenesis training, gate entropy remains near-maximal (1.094 vs maximum 1.099), indicating the organelle gate has not yet developed strong per-channel preferences. All three organelles contribute roughly equally to each channel.

This slow specialization may be attributed to:

1. **Redundant capacity:** At early training stages, any single organelle can reduce loss — the gradient signal doesn't yet distinguish their contributions.
2. **Softmax saturation:** With three organelles, the gradient through softmax is divided three ways, requiring stronger signal for one organelle to dominate.
3. **Initialization symmetry:** Zero-initialized gate logits create a symmetric starting point that gradients must break.

We expect specialization to emerge later in training as the loss approaches its asymptote and the model must extract finer-grained patterns.

### 7.2 Inductive Bias Complementarity

The three organelles provide complementary inductive biases:

| Property | CausalConv | Monarch | LongConv |
|---|---|---|---|
| Receptive field | Local (K tokens) | Global (all T) | Global (all T) |
| Mixing pattern | Per-channel, fixed kernel | Cross-position, structured | Per-channel, dense |
| Parameters | O(K*D) | O(T^(3/2)) per head | O(T*D) |
| Cross-channel | No | Yes (per head slice) | No |
| Position encoding | Implicit (causal padding) | Learned (factored matrices) | Learned (per-channel kernels) |
| Capacity | Low | Medium | High |

**CausalConv** handles local patterns that are common across channels — n-gram statistics, local syntax. **Monarch** provides structured global mixing that can capture long-range dependencies with a compact parameterization. **LongConv** offers the most expressive per-channel mixing, able to learn arbitrary causal filters for each embedding dimension.

### 7.3 Computational Cost Breakdown

Per-step compute distribution (estimated for D=256, T=256, B=32):

| Component | FLOPs | % of total |
|---|---|---|
| Token embedding | 2M | <1% |
| RMSNorm (x12) | 25M | 1% |
| CausalConv (x6) | 25M | 1% |
| Monarch realize + multiply (x6) | 800M | 26% |
| LongConv (x6) | 3.2B | **42%** |
| OrganelleGate (x6) | 12M | <1% |
| SwiGLU FFN (x6) | 1.9B | 25% |
| Output projection | 131M | 2% |

**LongConv dominates** the compute budget due to its O(T^2 * D) complexity. Future optimizations could replace the spatial-domain convolution with FFT-based convolution (O(T * log(T) * D)), potentially providing a 10-50x speedup in this component.

---

## 8. Related Work

**Monarch Mixer** (Dao et al., 2023): Sub-quadratic architecture using factored Monarch matrices for both sequence mixing and channel mixing. M2-BERT matches BERT-base at 27% compression. Our Monarch implementation is the first in Julia.

**Hyena** (Poli et al., 2023): Long convolutions for sequence modeling, replacing attention with learned implicit filters. Our LongConv organelle is similar in spirit but uses explicit per-channel kernels rather than implicit parameterization.

**S4/S5** (Gu et al., 2022): Structured state spaces with O(T * log(T)) complexity via HiPPO initialization and diagonal plus low-rank parameterization. S4 targets the same long-range modeling goal as our LongConv organelle.

**Mamba** (Gu & Dao, 2023): Selective state spaces with input-dependent gating. Mamba's selection mechanism is conceptually related to our OrganelleGate, though it operates within a single mixing mechanism rather than routing between multiple.

**Mixture of Experts** (Shazeer et al., 2017; Fedus et al., 2022): MoE routes tokens to different FFN experts. Our OrganelleGate is analogous but operates at the sequence mixing level rather than the FFN level, and routes per-channel rather than per-token.

**nanoGPT** (Karpathy, 2023): Minimal GPT-2 reimplementation. Our baseline Transformer follows this design philosophy.

**Depth Delusion** (2025): Demonstrates that width matters more than depth at small scale. Influences our decision to use wider embeddings (320d) with fewer layers (6) in Symbiogenesis v2.

---

## 9. Implementation Details

### 9.1 Float16 Mixed-Precision Considerations

During development, we discovered that Julia's type promotion rules can silently undermine Float16 mixed-precision training. When a Float16 tensor operates with a Float32 scalar or tensor, Julia promotes the result to Float32, causing:

1. **Loss of tensor core utilization:** cuBLAS falls back to slower mixed-type GEMM paths
2. **Increased memory consumption:** Activations stored as Float32 instead of Float16
3. **Performance degradation:** The broken AMP path was **3x slower** than pure Float32

Three promotion sites were identified and fixed:

```julia
# BROKEN: hardcoded Float32 scale
scale = Float32(1.0 / sqrt(Float64(HD)))

# FIXED: match input element type
scale = eltype(q)(1.0 / sqrt(Float64(HD)))
```

```julia
# BROKEN: Float32 caches applied to Float16 inputs
c = cos_cache[:, 1:seq_len]

# FIXED: cast caches to match input type
c = eltype(x).(cos_cache[:, 1:seq_len])
```

**Lesson:** In Julia's multiple dispatch system, type promotion is powerful but can be insidious in mixed-precision training. Every constant, cache, and mask must match the expected precision.

### 9.2 Monarch and Tensor Cores

On NVIDIA Ampere GPUs (RTX 3060), Float16 tensor cores provide acceleration for matrix multiplications where the inner dimensions are multiples of 8 and sufficiently large. Monarch's block matrices are (16, 16, 16) — at the borderline of tensor core efficiency. Our benchmarks show Float16 is actually **16% slower** than Float32 for Monarch-based models due to:

1. Type conversion overhead (Float32 master weights -> Float16 forward -> Float32 gradients)
2. Small matrix sizes not saturating tensor core throughput
3. Dynamic loss scaling overhead

**Recommendation:** Use Float32 for Monarch-based architectures on consumer GPUs. Float16 AMP is only beneficial when the dominant operations involve large matrices (e.g., standard attention with T >= 256).

### 9.3 Zygote Compatibility

All operations in the Symbiogenesis forward pass are compatible with Zygote.jl automatic differentiation. Key patterns:

- **Non-differentiable allocations** (padding, masks, identity matrices) are wrapped in `Zygote.@ignore`
- **Device portability** uses a `_to_device(reference, x)` helper that checks if the reference is a CuArray
- **In-place operations** are avoided in the differentiable path; all mutations happen in `@ignore` blocks
- **Indexing:** Monarch head slicing (`x[ch_start:ch_end, :, :]`) is differentiable through Zygote

---

## 10. Deployment

All three models are deployed as HuggingFace Spaces serving OpenAI-compatible APIs:

| Space | Architecture | Endpoint |
|---|---|---|
| [JuliaSLM](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM) | Transformer | `/v1/chat/completions` |
| [MonarchSLM](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM) | Monarch Mixer | `/v1/chat/completions` |
| [SymbioSLM](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM) | Symbiogenesis | `/v1/chat/completions` |

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). Each Space downloads its checkpoint from a corresponding HuggingFace model repository on startup.

---

## 11. Limitations and Future Work

### Current Limitations

1. **LongConv is the bottleneck:** O(T^2 * D) complexity per block. FFT-based convolution would reduce this to O(T * log(T) * D), potentially doubling overall throughput.

2. **Gate specialization is slow:** At 1000 steps, gate entropy remains near-maximal. Techniques like gate temperature annealing or auxiliary specialization losses could accelerate organelle differentiation.

3. **No custom CUDA kernels:** All operations use generic NNlib/CUDA.jl kernels. Fused Monarch realization + causal masking + matmul could provide significant speedup.

4. **Small scale evaluation:** All experiments are at ~5M parameters on a curated corpus. Scaling laws for Symbiogenesis remain unknown.

### Future Directions

1. **Neural ODE depth:** Replace discrete SymbioBlocks with a continuous-depth Neural ODE using DiffEqFlux.jl, enabling adaptive compute per token.

2. **Sparse organelle masking:** Dynamically disable organelles per block based on input difficulty, reducing compute for easy tokens.

3. **Cross-channel LongConv:** Replace per-channel LongConv with grouped convolutions that share kernels across related channels, reducing parameters while maintaining expressiveness.

4. **Scaling experiments:** Train 50M and 500M parameter Symbiogenesis models to understand scaling behavior of multi-organelle architectures.

5. **Gelation-guided training:** Use gelation detection to automatically adjust learning rate, batch size, or architectural parameters at phase transition boundaries.

---

## 12. Conclusion

Symbiogenesis demonstrates that multi-organelle sequence mixing is a viable alternative to softmax attention for small language models. By combining three complementary mixing mechanisms — local convolution, global structured mixing, and global dense filtering — through a learned per-channel gate, the architecture achieves competitive quality while providing rich inductive biases and 62% parameter reduction in sequence mixing.

The biological metaphor of symbiogenesis extends naturally: just as eukaryotic cells benefit from specialized organelles with different evolutionary origins, neural network blocks benefit from specialized mixing mechanisms with different mathematical properties. The OrganelleGate learns to exploit this complementarity, creating a "fused organism" that is more than the sum of its parts.

---

## References

1. Margulis, L. (1967). On the origin of mitosing cells. *Journal of Theoretical Biology*, 14(3), 225-274.
2. Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
3. Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
4. Gu, A., et al. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. *ICLR 2022*.
5. Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
6. Shazeer, N. (2020). GLU Variants Improve Transformer. *arXiv:2002.05202*.
7. Karpathy, A. (2023). nanoGPT. GitHub repository.
8. Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
9. Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization.
10. Page, E. S. (1954). Continuous inspection schemes. *Biometrika*, 41(1/2), 100-115.
11. Kuramoto, Y. (1984). *Chemical Oscillations, Waves, and Turbulence*. Springer.
12. Flory, P. J. (1941). Molecular Size Distribution in Three Dimensional Polymers. *Journal of the American Chemical Society*, 63(11), 3083-3090.

---

## Appendix A: Parameter Count Details

### 5M Symbiogenesis (256d, 6 layers, 4 Monarch heads)

```
Embedding:                    2000 x 256 =   512,000  (tied with output)

Per block (x6):
  RMSNorm x 2:                256 x 2    =       512
  CausalConv:                 4 x 256    =     1,024
  Monarch (4 heads):     4 x 2 x 16^3   =    32,768
  LongConv:              256 x 256       =    65,536
  OrganelleGate:               3 x 256   =       768
  SwiGLU FFN:
    W1: 256 x 640              =           163,840
    V:  256 x 640              =           163,840
    W2: 640 x 256              =           163,840
  Block total:                             591,616

6 blocks:                                3,549,696
Final RMSNorm:                                  256
Embedding (tied):                           512,000

TOTAL:                                   4,061,952
```

### 5M Transformer (256d, 6 layers, 4 heads)

```
Embedding:                    2000 x 256 =   512,000  (tied with output)

Per block (x6):
  RMSNorm x 2:                256 x 2    =       512
  Attention (Q,K,V,O):   4 x 256 x 256  =   262,144
  SwiGLU FFN:
    W1, V, W2:           3 x 256 x 640  =   491,520
  Block total:                             754,176

6 blocks:                                4,525,056
Final RMSNorm:                                  256
Embedding (tied):                           512,000

TOTAL:                                   5,037,312
```

## Appendix B: Generated Text Samples

*[To be added after full training completion]*

---

*Built entirely in Julia. MIT License.*