File size: 9,014 Bytes
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
 
 
e4c3c9e
e57e239
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
 
 
 
3bf5aa2
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e57e239
 
3bf5aa2
 
 
e4c3c9e
 
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - monarch-mixer
  - sub-quadratic
  - structured-matrix
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: MonarchSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 38.4
            name: Val PPL
          - type: loss
            value: 3.65
            name: Val Loss
---

# MonarchSLM

A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.

Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

## Model Family

MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:

| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |

## Architecture

```
JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- blocks x 8:
|   +-- ln1: RMSNorm(256)
|   +-- seq_mixer: MonarchSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
|   |   +-- gate: LearnedGate(256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```

### How Monarch Sequence Mixing Works

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

```
M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
```

where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

**Per-head forward pass:**

1. Realize the T x T mixing matrix M from learned factors L1, L2
2. Apply a multiplicative 0/1 causal mask (lower triangular)
3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
4. A short causal convolution (kernel=4) provides complementary local n-gram context
5. Conv and Monarch outputs are combined via a learned sigmoid gate

**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.

### Key Differences from Transformer

| Property | Transformer | Monarch Mixer |
|---|---|---|
| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
| Seq mixer params/block | 262K | **67K** (74% reduction) |
| Layers (same param budget) | 6 | **8** (extra layers from param savings) |

### Parameter Efficiency

The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

| Component | Params per block |
|---|---|
| CausalDepthwiseConv1d (K=4) | 1,024 |
| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
| LearnedGate | 256 |
| **Total sequence mixing** | **66,816** |
| SwiGLU FFN | 491,520 |
| RMSNorm x 2 | 512 |
| **Block total** | 558,848 |

## Model Details

| Parameter | Value |
|---|---|
| Total parameters | 4,983,040 |
| Embedding dim | 256 |
| Layers | 8 |
| Monarch heads | 8 |
| Channels per head | 32 |
| Block size (p) | 16 (T = p^2 = 256) |
| Conv kernel size | 4 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | None (learned in Monarch matrices) |
| Weight tying | Yes |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 89 minutes |
| Throughput | ~19K tok/s |

### Training Curves

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | **3.65** | **38.4** |

### Key Findings

- Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
- Both models generate coherent English with dialogue, grammar, and philosophical content

## Relationship to Symbiogenesis

MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.

## Implementation

Built entirely in Julia:

- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations

Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

## Usage

### OpenAI-Compatible API

Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):

```bash
curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

### Load in Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="monarch", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
    n_monarch_heads=8, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
```

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Provenance

- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
- **First Julia implementation** of Monarch Mixer sequence mixing

## References

- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
- Karpathy, A. (2023). nanoGPT. GitHub repository.

## Citation

```bibtex
@misc{monarchslm2026,
  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}
```

## License

MIT