--- language: - en license: mit library_name: lux tags: - julia - lux - slm - philosophy - monarch-mixer - sub-quadratic - structured-matrix - rmsnorm - swiglu - bpe - text-generation pipeline_tag: text-generation datasets: - LisaMegaWatts/philosophy-corpus model-index: - name: MonarchSLM results: - task: type: text-generation name: Text Generation dataset: type: LisaMegaWatts/philosophy-corpus name: philosophy-corpus metrics: - type: perplexity value: 38.4 name: Val PPL - type: loss value: 3.65 name: Val Loss --- # MonarchSLM A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures. ## Model Family MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets: | Model | Architecture | Sequence Mixing | Val PPL | Params | |---|---|---|---|---| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M | | **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M | ## Architecture ``` JuliaGPTModel (monarch) +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head] +-- blocks x 8: | +-- ln1: RMSNorm(256) | +-- seq_mixer: MonarchSequenceMixer | | +-- conv: CausalDepthwiseConv1d(256, kernel=4) | | +-- monarchs: 8 x MonarchMatrix(T=256, p=16) | | | +-- L1: (16, 16, 16) # block-diagonal factor 1 | | | +-- L2: (16, 16, 16) # block-diagonal factor 2 | | +-- gate: LearnedGate(256) | +-- ln2: RMSNorm(256) | +-- ffn: SwiGLU(256 -> 640 -> 256) +-- ln_f: RMSNorm(256) +-- head: TiedEmbeddingHead -> (2000,) ``` ### How Monarch Sequence Mixing Works Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as: ``` M = P^T * BlockDiag(L1) * P * BlockDiag(L2) ``` where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices. **Per-head forward pass:** 1. Realize the T x T mixing matrix M from learned factors L1, L2 2. Apply a multiplicative 0/1 causal mask (lower triangular) 3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension 4. A short causal convolution (kernel=4) provides complementary local n-gram context 5. Conv and Monarch outputs are combined via a learned sigmoid gate **No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly. ### Key Differences from Transformer | Property | Transformer | Monarch Mixer | |---|---|---| | Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) | | Position encoding | RoPE (separate) | None (implicit in Monarch matrices) | | Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply | | Seq mixer params/block | 262K | **67K** (74% reduction) | | Layers (same param budget) | 6 | **8** (extra layers from param savings) | ### Parameter Efficiency The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget: | Component | Params per block | |---|---| | CausalDepthwiseConv1d (K=4) | 1,024 | | 8 x MonarchMatrix (2 x 16^3 each) | 65,536 | | LearnedGate | 256 | | **Total sequence mixing** | **66,816** | | SwiGLU FFN | 491,520 | | RMSNorm x 2 | 512 | | **Block total** | 558,848 | ## Model Details | Parameter | Value | |---|---| | Total parameters | 4,983,040 | | Embedding dim | 256 | | Layers | 8 | | Monarch heads | 8 | | Channels per head | 32 | | Block size (p) | 16 (T = p^2 = 256) | | Conv kernel size | 4 | | FFN hidden dim | 640 | | Context length | 256 tokens | | Vocabulary | 2,000 (ByteLevel BPE) | | Position encoding | None (learned in Monarch matrices) | | Weight tying | Yes | ## Training | | Value | |---|---| | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) | | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) | | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) | | Warmup | 500 steps (linear) | | Max steps | 12,305 | | Batch size | 32 | | Gradient clipping | 1.0 (global norm) | | Precision | Float16 AMP (Float32 master weights) | | Hardware | NVIDIA RTX 3060 12GB | | Training time | 89 minutes | | Throughput | ~19K tok/s | ### Training Curves | Step | Train Loss | Val Loss | Val PPL | |---|---|---|---| | 500 | 7.28 | 5.58 | 265.4 | | 2,000 | 4.29 | 4.21 | 67.6 | | 6,000 | 3.83 | 3.81 | 45.3 | | 10,000 | 3.69 | 3.68 | 39.6 | | 12,305 | 3.66 | **3.65** | **38.4** | ### Key Findings - Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget - The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers - The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention - Throughput is 27% lower than Transformer due to Monarch matrix realization overhead - Both models generate coherent English with dialogue, grammar, and philosophical content ## Relationship to Symbiogenesis MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate. The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell. ## Implementation Built entirely in Julia: - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote. Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). ## Usage ### OpenAI-Compatible API Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM): ```bash curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200, "temperature": 0.8, "top_k": 40 }' ``` ### Load in Julia ```julia using Pkg; Pkg.activate("julia-slm") include("src/JuliaGPT.jl") using .JuliaGPT; using .JuliaGPT: Lux tok = BPETokenizer("vocab.json", "merges.txt") ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device()) model = create_model(ModelConfig(; arch="monarch", vocab_size=vocab_size(tok), embed_dim=256, n_layers=8, n_heads=4, head_dim=64, n_monarch_heads=8, conv_kernel_size=4, ffn_mult=4, context_length=256, weight_tying=true, )) text = generate(model, ps, st, tok, "the nature of "; max_new_tokens=200, temperature=0.8, top_k=40) ``` ## Files | File | Description | |---|---| | `final.jld2` | Trained model parameters (JLD2 format, 74MB) | | `config.toml` | Model architecture configuration | | `vocab.json` | BPE vocabulary (2000 tokens) | | `merges.txt` | BPE merge rules | ## Provenance - **Author**: LisaMegaWatts - **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm) - **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline) - **Training date**: February 2026 - **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl - **First Julia implementation** of Monarch Mixer sequence mixing ## References - Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*. - Karpathy, A. (2023). nanoGPT. GitHub repository. ## Citation ```bibtex @misc{monarchslm2026, title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia}, author={LisaMegaWatts}, year={2026}, url={https://huggingface.co/LisaMegaWatts/MonarchSLM} } ``` ## License MIT