| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: lux |
| | tags: |
| | - julia |
| | - lux |
| | - slm |
| | - philosophy |
| | - monarch-mixer |
| | - sub-quadratic |
| | - structured-matrix |
| | - rmsnorm |
| | - swiglu |
| | - bpe |
| | - text-generation |
| | pipeline_tag: text-generation |
| | datasets: |
| | - LisaMegaWatts/philosophy-corpus |
| | model-index: |
| | - name: MonarchSLM |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Text Generation |
| | dataset: |
| | type: LisaMegaWatts/philosophy-corpus |
| | name: philosophy-corpus |
| | metrics: |
| | - type: perplexity |
| | value: 38.4 |
| | name: Val PPL |
| | - type: loss |
| | value: 3.65 |
| | name: Val Loss |
| | --- |
| | |
| | # MonarchSLM |
| |
|
| | A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**. |
| |
|
| | Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures. |
| |
|
| | ## Model Family |
| |
|
| | MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets: |
| |
|
| | | Model | Architecture | Sequence Mixing | Val PPL | Params | |
| | |---|---|---|---|---| |
| | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M | |
| | | **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M | |
| | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M | |
| |
|
| | ## Architecture |
| |
|
| | ``` |
| | JuliaGPTModel (monarch) |
| | +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head] |
| | +-- blocks x 8: |
| | | +-- ln1: RMSNorm(256) |
| | | +-- seq_mixer: MonarchSequenceMixer |
| | | | +-- conv: CausalDepthwiseConv1d(256, kernel=4) |
| | | | +-- monarchs: 8 x MonarchMatrix(T=256, p=16) |
| | | | | +-- L1: (16, 16, 16) # block-diagonal factor 1 |
| | | | | +-- L2: (16, 16, 16) # block-diagonal factor 2 |
| | | | +-- gate: LearnedGate(256) |
| | | +-- ln2: RMSNorm(256) |
| | | +-- ffn: SwiGLU(256 -> 640 -> 256) |
| | +-- ln_f: RMSNorm(256) |
| | +-- head: TiedEmbeddingHead -> (2000,) |
| | ``` |
| |
|
| | ### How Monarch Sequence Mixing Works |
| |
|
| | Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as: |
| |
|
| | ``` |
| | M = P^T * BlockDiag(L1) * P * BlockDiag(L2) |
| | ``` |
| |
|
| | where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices. |
| |
|
| | **Per-head forward pass:** |
| |
|
| | 1. Realize the T x T mixing matrix M from learned factors L1, L2 |
| | 2. Apply a multiplicative 0/1 causal mask (lower triangular) |
| | 3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension |
| | 4. A short causal convolution (kernel=4) provides complementary local n-gram context |
| | 5. Conv and Monarch outputs are combined via a learned sigmoid gate |
| |
|
| | **No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly. |
| |
|
| | ### Key Differences from Transformer |
| |
|
| | | Property | Transformer | Monarch Mixer | |
| | |---|---|---| |
| | | Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) | |
| | | Position encoding | RoPE (separate) | None (implicit in Monarch matrices) | |
| | | Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply | |
| | | Seq mixer params/block | 262K | **67K** (74% reduction) | |
| | | Layers (same param budget) | 6 | **8** (extra layers from param savings) | |
| |
|
| | ### Parameter Efficiency |
| |
|
| | The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget: |
| |
|
| | | Component | Params per block | |
| | |---|---| |
| | | CausalDepthwiseConv1d (K=4) | 1,024 | |
| | | 8 x MonarchMatrix (2 x 16^3 each) | 65,536 | |
| | | LearnedGate | 256 | |
| | | **Total sequence mixing** | **66,816** | |
| | | SwiGLU FFN | 491,520 | |
| | | RMSNorm x 2 | 512 | |
| | | **Block total** | 558,848 | |
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |---|---| |
| | | Total parameters | 4,983,040 | |
| | | Embedding dim | 256 | |
| | | Layers | 8 | |
| | | Monarch heads | 8 | |
| | | Channels per head | 32 | |
| | | Block size (p) | 16 (T = p^2 = 256) | |
| | | Conv kernel size | 4 | |
| | | FFN hidden dim | 640 | |
| | | Context length | 256 tokens | |
| | | Vocabulary | 2,000 (ByteLevel BPE) | |
| | | Position encoding | None (learned in Monarch matrices) | |
| | | Weight tying | Yes | |
| |
|
| | ## Training |
| |
|
| | | | Value | |
| | |---|---| |
| | | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | |
| | | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) | |
| | | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) | |
| | | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) | |
| | | Warmup | 500 steps (linear) | |
| | | Max steps | 12,305 | |
| | | Batch size | 32 | |
| | | Gradient clipping | 1.0 (global norm) | |
| | | Precision | Float16 AMP (Float32 master weights) | |
| | | Hardware | NVIDIA RTX 3060 12GB | |
| | | Training time | 89 minutes | |
| | | Throughput | ~19K tok/s | |
| | |
| | ### Training Curves |
| | |
| | | Step | Train Loss | Val Loss | Val PPL | |
| | |---|---|---|---| |
| | | 500 | 7.28 | 5.58 | 265.4 | |
| | | 2,000 | 4.29 | 4.21 | 67.6 | |
| | | 6,000 | 3.83 | 3.81 | 45.3 | |
| | | 10,000 | 3.69 | 3.68 | 39.6 | |
| | | 12,305 | 3.66 | **3.65** | **38.4** | |
| | |
| | ### Key Findings |
| | |
| | - Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget |
| | - The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers |
| | - The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention |
| | - Throughput is 27% lower than Transformer due to Monarch matrix realization overhead |
| | - Both models generate coherent English with dialogue, grammar, and philosophical content |
| | |
| | ## Relationship to Symbiogenesis |
| | |
| | MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate. |
| | |
| | The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell. |
| | |
| | ## Implementation |
| | |
| | Built entirely in Julia: |
| | |
| | - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework |
| | - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation |
| | - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration |
| | - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations |
| |
|
| | Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote. |
| |
|
| | Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). |
| |
|
| | ## Usage |
| |
|
| | ### OpenAI-Compatible API |
| |
|
| | Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM): |
| |
|
| | ```bash |
| | curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \ |
| | -H "Content-Type: application/json" \ |
| | -d '{ |
| | "messages": [{"role": "user", "content": "the nature of"}], |
| | "max_tokens": 200, |
| | "temperature": 0.8, |
| | "top_k": 40 |
| | }' |
| | ``` |
| |
|
| | ### Load in Julia |
| |
|
| | ```julia |
| | using Pkg; Pkg.activate("julia-slm") |
| | include("src/JuliaGPT.jl") |
| | using .JuliaGPT; using .JuliaGPT: Lux |
| | |
| | tok = BPETokenizer("vocab.json", "merges.txt") |
| | ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device()) |
| | |
| | model = create_model(ModelConfig(; |
| | arch="monarch", vocab_size=vocab_size(tok), |
| | embed_dim=256, n_layers=8, n_heads=4, head_dim=64, |
| | n_monarch_heads=8, conv_kernel_size=4, |
| | ffn_mult=4, context_length=256, weight_tying=true, |
| | )) |
| | |
| | text = generate(model, ps, st, tok, "the nature of "; |
| | max_new_tokens=200, temperature=0.8, top_k=40) |
| | ``` |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |---|---| |
| | | `final.jld2` | Trained model parameters (JLD2 format, 74MB) | |
| | | `config.toml` | Model architecture configuration | |
| | | `vocab.json` | BPE vocabulary (2000 tokens) | |
| | | `merges.txt` | BPE merge rules | |
| |
|
| | ## Provenance |
| |
|
| | - **Author**: LisaMegaWatts |
| | - **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm) |
| | - **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline) |
| | - **Training date**: February 2026 |
| | - **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl |
| | - **First Julia implementation** of Monarch Mixer sequence mixing |
| |
|
| | ## References |
| |
|
| | - Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*. |
| | - Karpathy, A. (2023). nanoGPT. GitHub repository. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{monarchslm2026, |
| | title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia}, |
| | author={LisaMegaWatts}, |
| | year={2026}, |
| | url={https://huggingface.co/LisaMegaWatts/MonarchSLM} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|