LisaMegaWatts
/

MonarchSLM

@@ -1,176 +1,263 @@
 ---
 language:
-- en
-library_name: julia
 license: mit
-pipeline_tag: text-generation
 tags:
-- philosophy
-- classical-texts
-- julia
-- lux
-- bpe
-- monarch-mixer
-- rmsnorm
-- swiglu
-- small-language-model
-- openai-compatible
-- chinchilla
-- sub-quadratic
 datasets:
-- LisaMegaWatts/philosophy-corpus
 ---
-# MonarchSLM — Inference Server Artifacts
-Serving-ready artifacts for the [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM), an OpenAI-compatible inference endpoint for the 5M parameter Monarch Mixer model.
-For full training details, loss curves, architecture comparison, and code see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
-## Model Summary
-A 4,983,040 parameter decoder-only model using **Monarch Mixer** sequence mixing (Dao et al., 2023) trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts. First known Julia implementation of Monarch Mixer for language modeling.
-### Architecture
 ```
-JuliaGPTModel
-├── tok_emb: Embedding(2000 → 256)     # weight-tied with output head
-├── blocks × 8:
-│   ├── ln1: RMSNorm(256)
-│   ├── seq_mixer: MonarchSequenceMixer
-│   │   ├── conv: CausalDepthwiseConv1d(256, kernel=4)
-│   │   ├── monarchs × 8: MonarchMatrix(256, L1/L2 ∈ ℝ^{16×16×16})
-│   │   └── gate: LearnedGate(256)
-│   ├── ln2: RMSNorm(256)
-│   └── ffn: SwiGLU(256 → 640 → 256)
-├── ln_f: RMSNorm(256)
-└── head: TiedEmbeddingHead → (2000,)  # shares tok_emb weights
 ```
-### Monarch Matrix
-A Monarch matrix of size T×T (T=p²=256, p=16) factorizes as:
 ```
-M = Pᵀ · BlockDiag(L1) · P · BlockDiag(L2)
 ```
-- L1, L2: p block-diagonal matrices of size p×p
-- P: reshape-transpose permutation
-- **Parameters per head**: 2p³ = 8,192 (vs 65,536 for dense T²)
-| Component | Detail |
 |---|---|
-| Parameters | 4,983,040 |
 | Embedding dim | 256 |
 | Layers | 8 |
-| Monarch heads | 8 (each mixing 32 channels over 256 positions) |
-| Conv kernel | 4 (causal depthwise) |
-| FFN multiplier | 4x (SwiGLU, hidden 640) |
 | Context length | 256 tokens |
-| Normalization | RMSNorm (pre-norm) |
 | Weight tying | Yes |
-| Bias | None |
-### Training
-| Metric | Value |
 |---|---|
-| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
-| Schedule | Cosine decay with 500-step warmup |
-| Precision | Mixed F16/F32 |
 | Batch size | 32 |
-| Training steps | 12,305 |
-| Tokens processed | ~100M |
-| Training time | 89 min on RTX 3060 12GB |
 | Throughput | ~19K tok/s |
-| Final val loss | 3.65 |
-| Final val PPL | 38.4 |
-### Loss Curve
 | Step | Train Loss | Val Loss | Val PPL |
-|------|-----------|----------|---------|
-| 500 | 6.31 | 5.26 | 192.4 |
-| 2,000 | 4.15 | 4.15 | 63.4 |
-| 6,000 | 3.77 | 3.79 | 44.3 |
-| 10,000 | 3.62 | 3.67 | 39.3 |
-| 12,305 | 3.62 | 3.65 | 38.4 |
-### Comparison with Transformer Baseline
-| Metric | Transformer | Monarch Mixer |
-|---|---|---|
-| Parameters | 5,037,312 | 4,983,040 |
-| Blocks | 6 | 8 |
-| Val Loss | **3.54** | 3.65 |
-| Val PPL | **34.5** | 38.4 |
-| Training time | 66 min | 89 min |
-| Seq mixing params/block | 262K | 67K (4x fewer) |
-Monarch reaches **94% of baseline quality** while using **4x fewer parameters per block** in sequence mixing, enabling 8 blocks instead of 6.
-### Tokenizer
-ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
-### Training Data
-[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) — 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
-- **Train tokens**: 794.9M (pre-encoded as `train.bin`)
-- **Val tokens**: 88.2M (pre-encoded as `val.bin`)
-- **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
-## Files
-| File | Description |
-|---|---|
-| `final.jld2` | Model parameters (JLD2 format, 74MB) |
-| `config.toml` | Architecture config (5m-monarch) |
-| `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
-| `merges.txt` | BPE merge rules |
-## Inference API
-The [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
 ```bash
-# Streaming
 curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
-# Non-streaming
-curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
 ```
-### Endpoints
-- `GET /` — Health check and model info
-- `GET /v1/models` — List available models
-- `POST /v1/chat/completions` — Generate text (streaming + non-streaming)
-## Framework
-Built with:
-- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks (training)
-- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — batched_mul, softmax, activations (inference)
-- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation (training)
-- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration (training)
 ## References
-- [Monarch Mixer (Dao et al., 2023)](https://arxiv.org/abs/2310.12109) — Sub-quadratic GEMM-based architecture
-- [Chinchilla (Hoffmann et al., 2022)](https://arxiv.org/abs/2203.15556) — Compute-optimal training scaling
-## Related
-- **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** — Canonical model repo (both transformer and monarch variants)
-- **[LisaMegaWatts/JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM)** — Transformer variant inference artifacts
-- **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** — Transformer inference endpoint
-- **[MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM)** — This model's inference endpoint
-- **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** — Training dataset + tokenizer

 ---
 language:
+  - en
 license: mit
+library_name: lux
 tags:
+  - julia
+  - lux
+  - slm
+  - philosophy
+  - monarch-mixer
+  - sub-quadratic
+  - structured-matrix
+  - rmsnorm
+  - swiglu
+  - bpe
+  - text-generation
+pipeline_tag: text-generation
 datasets:
+  - LisaMegaWatts/philosophy-corpus
+model-index:
+  - name: MonarchSLM
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        dataset:
+          type: LisaMegaWatts/philosophy-corpus
+          name: philosophy-corpus
+        metrics:
+          - type: perplexity
+            value: 38.4
+            name: Val PPL
+          - type: loss
+            value: 3.65
+            name: Val Loss
 ---
+# MonarchSLM
+A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.
+Part of the [Julia SLM](https://github.com/buildwithbooks/julia-slm) family of models exploring alternative sequence mixing architectures.
+## Model Family
+MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:
+| Model | Architecture | Sequence Mixing | Val PPL | Params |
+|---|---|---|---|---|
+| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
+| **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
+| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
+## Architecture
 ```
+JuliaGPTModel (monarch)
++-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
++-- blocks x 8:
+|   +-- ln1: RMSNorm(256)
+|   +-- seq_mixer: MonarchSequenceMixer
+|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
+|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
+|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
+|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
+|   |   +-- gate: LearnedGate(256)
+|   +-- ln2: RMSNorm(256)
+|   +-- ffn: SwiGLU(256 -> 640 -> 256)
++-- ln_f: RMSNorm(256)
++-- head: TiedEmbeddingHead -> (2000,)
 ```
+### How Monarch Sequence Mixing Works
+Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:
 ```
+M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
 ```
+where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.
+**Per-head forward pass:**
+1. Realize the T x T mixing matrix M from learned factors L1, L2
+2. Apply a multiplicative 0/1 causal mask (lower triangular)
+3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
+4. A short causal convolution (kernel=4) provides complementary local n-gram context
+5. Conv and Monarch outputs are combined via a learned sigmoid gate
+**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.
+### Key Differences from Transformer
+| Property | Transformer | Monarch Mixer |
+|---|---|---|
+| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
+| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
+| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
+| Seq mixer params/block | 262K | **67K** (74% reduction) |
+| Layers (same param budget) | 6 | **8** (extra layers from param savings) |
+### Parameter Efficiency
+The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:
+| Component | Params per block |
+|---|---|
+| CausalDepthwiseConv1d (K=4) | 1,024 |
+| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
+| LearnedGate | 256 |
+| **Total sequence mixing** | **66,816** |
+| SwiGLU FFN | 491,520 |
+| RMSNorm x 2 | 512 |
+| **Block total** | 558,848 |
+## Model Details
+| Parameter | Value |
 |---|---|
+| Total parameters | 4,983,040 |
 | Embedding dim | 256 |
 | Layers | 8 |
+| Monarch heads | 8 |
+| Channels per head | 32 |
+| Block size (p) | 16 (T = p^2 = 256) |
+| Conv kernel size | 4 |
+| FFN hidden dim | 640 |
 | Context length | 256 tokens |
+| Vocabulary | 2,000 (ByteLevel BPE) |
+| Position encoding | None (learned in Monarch matrices) |
 | Weight tying | Yes |
+## Training
+| | Value |
 |---|---|
+| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
+| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
+| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
+| Warmup | 500 steps (linear) |
+| Max steps | 12,305 |
 | Batch size | 32 |
+| Gradient clipping | 1.0 (global norm) |
+| Precision | Float16 AMP (Float32 master weights) |
+| Hardware | NVIDIA RTX 3060 12GB |
+| Training time | 89 minutes |
 | Throughput | ~19K tok/s |
+### Training Curves
 | Step | Train Loss | Val Loss | Val PPL |
+|---|---|---|---|
+| 500 | 7.28 | 5.58 | 265.4 |
+| 2,000 | 4.29 | 4.21 | 67.6 |
+| 6,000 | 3.83 | 3.81 | 45.3 |
+| 10,000 | 3.69 | 3.68 | 39.6 |
+| 12,305 | 3.66 | **3.65** | **38.4** |
+### Key Findings
+- Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
+- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
+- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
+- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
+- Both models generate coherent English with dialogue, grammar, and philosophical content
+## Relationship to Symbiogenesis
+MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.
+The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.
+## Implementation
+Built entirely in Julia:
+- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
+- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
+- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
+- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations
+Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.
+Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
+## Usage
+### OpenAI-Compatible API
+Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):
 ```bash
 curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "the nature of"}],
+    "max_tokens": 200,
+    "temperature": 0.8,
+    "top_k": 40
+  }'
+```
+### Load in Julia
+```julia
+using Pkg; Pkg.activate("julia-slm")
+include("src/JuliaGPT.jl")
+using .JuliaGPT; using .JuliaGPT: Lux
+tok = BPETokenizer("vocab.json", "merges.txt")
+ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
+model = create_model(ModelConfig(;
+    arch="monarch", vocab_size=vocab_size(tok),
+    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
+    n_monarch_heads=8, conv_kernel_size=4,
+    ffn_mult=4, context_length=256, weight_tying=true,
+))
+text = generate(model, ps, st, tok, "the nature of ";
+    max_new_tokens=200, temperature=0.8, top_k=40)
 ```
+## Files
+| File | Description |
+|---|---|
+| `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
+| `config.toml` | Model architecture configuration |
+| `vocab.json` | BPE vocabulary (2000 tokens) |
+| `merges.txt` | BPE merge rules |
+## Provenance
+- **Author**: LisaMegaWatts
+- **Training code**: [buildwithbooks/julia-slm](https://github.com/buildwithbooks/julia-slm)
+- **Data pipeline**: [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline)
+- **Training date**: February 2026
+- **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
+- **First Julia implementation** of Monarch Mixer sequence mixing
 ## References
+- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
+- Karpathy, A. (2023). nanoGPT. GitHub repository.
+## Citation
+```bibtex
+@misc{monarchslm2026,
+  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
+  author={LisaMegaWatts},
+  year={2026},
+  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
+}
+```
+## License
+MIT