File size: 9,014 Bytes
e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e e57e239 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e57e239 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 e4c3c9e 3bf5aa2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | ---
language:
- en
license: mit
library_name: lux
tags:
- julia
- lux
- slm
- philosophy
- monarch-mixer
- sub-quadratic
- structured-matrix
- rmsnorm
- swiglu
- bpe
- text-generation
pipeline_tag: text-generation
datasets:
- LisaMegaWatts/philosophy-corpus
model-index:
- name: MonarchSLM
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: LisaMegaWatts/philosophy-corpus
name: philosophy-corpus
metrics:
- type: perplexity
value: 38.4
name: Val PPL
- type: loss
value: 3.65
name: Val Loss
---
# MonarchSLM
A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.
Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.
## Model Family
MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
## Architecture
```
JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- blocks x 8:
| +-- ln1: RMSNorm(256)
| +-- seq_mixer: MonarchSequenceMixer
| | +-- conv: CausalDepthwiseConv1d(256, kernel=4)
| | +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
| | | +-- L1: (16, 16, 16) # block-diagonal factor 1
| | | +-- L2: (16, 16, 16) # block-diagonal factor 2
| | +-- gate: LearnedGate(256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```
### How Monarch Sequence Mixing Works
Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:
```
M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
```
where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.
**Per-head forward pass:**
1. Realize the T x T mixing matrix M from learned factors L1, L2
2. Apply a multiplicative 0/1 causal mask (lower triangular)
3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
4. A short causal convolution (kernel=4) provides complementary local n-gram context
5. Conv and Monarch outputs are combined via a learned sigmoid gate
**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.
### Key Differences from Transformer
| Property | Transformer | Monarch Mixer |
|---|---|---|
| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
| Seq mixer params/block | 262K | **67K** (74% reduction) |
| Layers (same param budget) | 6 | **8** (extra layers from param savings) |
### Parameter Efficiency
The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:
| Component | Params per block |
|---|---|
| CausalDepthwiseConv1d (K=4) | 1,024 |
| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
| LearnedGate | 256 |
| **Total sequence mixing** | **66,816** |
| SwiGLU FFN | 491,520 |
| RMSNorm x 2 | 512 |
| **Block total** | 558,848 |
## Model Details
| Parameter | Value |
|---|---|
| Total parameters | 4,983,040 |
| Embedding dim | 256 |
| Layers | 8 |
| Monarch heads | 8 |
| Channels per head | 32 |
| Block size (p) | 16 (T = p^2 = 256) |
| Conv kernel size | 4 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | None (learned in Monarch matrices) |
| Weight tying | Yes |
## Training
| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 89 minutes |
| Throughput | ~19K tok/s |
### Training Curves
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | **3.65** | **38.4** |
### Key Findings
- Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
- Both models generate coherent English with dialogue, grammar, and philosophical content
## Relationship to Symbiogenesis
MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.
The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.
## Implementation
Built entirely in Julia:
- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations
Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
## Usage
### OpenAI-Compatible API
Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):
```bash
curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
```
### Load in Julia
```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="monarch", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
n_monarch_heads=8, conv_kernel_size=4,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
```
## Files
| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |
## Provenance
- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
- **First Julia implementation** of Monarch Mixer sequence mixing
## References
- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
- Karpathy, A. (2023). nanoGPT. GitHub repository.
## Citation
```bibtex
@misc{monarchslm2026,
title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}
```
## License
MIT
|