LisaMegaWatts commited on
Commit
3bf5aa2
·
verified ·
1 Parent(s): e4c3c9e

Add model card with architecture details, provenance, and training metrics

Browse files
Files changed (1) hide show
  1. README.md +201 -114
README.md CHANGED
@@ -1,176 +1,263 @@
1
  ---
2
  language:
3
- - en
4
- library_name: julia
5
  license: mit
6
- pipeline_tag: text-generation
7
  tags:
8
- - philosophy
9
- - classical-texts
10
- - julia
11
- - lux
12
- - bpe
13
- - monarch-mixer
14
- - rmsnorm
15
- - swiglu
16
- - small-language-model
17
- - openai-compatible
18
- - chinchilla
19
- - sub-quadratic
20
  datasets:
21
- - LisaMegaWatts/philosophy-corpus
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ---
23
 
24
- # MonarchSLM — Inference Server Artifacts
 
 
25
 
26
- Serving-ready artifacts for the [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM), an OpenAI-compatible inference endpoint for the 5M parameter Monarch Mixer model.
27
 
28
- For full training details, loss curves, architecture comparison, and code see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
29
 
30
- ## Model Summary
31
 
32
- A 4,983,040 parameter decoder-only model using **Monarch Mixer** sequence mixing (Dao et al., 2023) trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts. First known Julia implementation of Monarch Mixer for language modeling.
 
 
 
 
33
 
34
- ### Architecture
35
 
36
  ```
37
- JuliaGPTModel
38
- ├── tok_emb: Embedding(2000 256) # weight-tied with output head
39
- ├── blocks × 8:
40
- ├── ln1: RMSNorm(256)
41
- ├── seq_mixer: MonarchSequenceMixer
42
- ├── conv: CausalDepthwiseConv1d(256, kernel=4)
43
- ├── monarchs × 8: MonarchMatrix(256, L1/L2 ∈ ℝ^{16×16×16})
44
- └── gate: LearnedGate(256)
45
- ├── ln2: RMSNorm(256)
46
- └── ffn: SwiGLU(256 → 640 → 256)
47
- ├── ln_f: RMSNorm(256)
48
- └── head: TiedEmbeddingHead (2000,) # shares tok_emb weights
 
 
49
  ```
50
 
51
- ### Monarch Matrix
52
 
53
- A Monarch matrix of size T×T (T=p²=256, p=16) factorizes as:
54
 
55
  ```
56
- M = P · BlockDiag(L1) · P · BlockDiag(L2)
57
  ```
58
 
59
- - L1, L2: p block-diagonal matrices of size p×p
60
- - P: reshape-transpose permutation
61
- - **Parameters per head**: 2p³ = 8,192 (vs 65,536 for dense T²)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- | Component | Detail |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  |---|---|
65
- | Parameters | 4,983,040 |
66
  | Embedding dim | 256 |
67
  | Layers | 8 |
68
- | Monarch heads | 8 (each mixing 32 channels over 256 positions) |
69
- | Conv kernel | 4 (causal depthwise) |
70
- | FFN multiplier | 4x (SwiGLU, hidden 640) |
 
 
71
  | Context length | 256 tokens |
72
- | Normalization | RMSNorm (pre-norm) |
 
73
  | Weight tying | Yes |
74
- | Bias | None |
75
 
76
- ### Training
77
 
78
- | Metric | Value |
79
  |---|---|
80
- | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
81
- | Schedule | Cosine decay with 500-step warmup |
82
- | Precision | Mixed F16/F32 |
 
 
 
83
  | Batch size | 32 |
84
- | Training steps | 12,305 |
85
- | Tokens processed | ~100M |
86
- | Training time | 89 min on RTX 3060 12GB |
 
87
  | Throughput | ~19K tok/s |
88
- | Final val loss | 3.65 |
89
- | Final val PPL | 38.4 |
90
 
91
- ### Loss Curve
92
 
93
  | Step | Train Loss | Val Loss | Val PPL |
94
- |------|-----------|----------|---------|
95
- | 500 | 6.31 | 5.26 | 192.4 |
96
- | 2,000 | 4.15 | 4.15 | 63.4 |
97
- | 6,000 | 3.77 | 3.79 | 44.3 |
98
- | 10,000 | 3.62 | 3.67 | 39.3 |
99
- | 12,305 | 3.62 | 3.65 | 38.4 |
100
 
101
- ### Comparison with Transformer Baseline
102
 
103
- | Metric | Transformer | Monarch Mixer |
104
- |---|---|---|
105
- | Parameters | 5,037,312 | 4,983,040 |
106
- | Blocks | 6 | 8 |
107
- | Val Loss | **3.54** | 3.65 |
108
- | Val PPL | **34.5** | 38.4 |
109
- | Training time | 66 min | 89 min |
110
- | Seq mixing params/block | 262K | 67K (4x fewer) |
111
 
112
- Monarch reaches **94% of baseline quality** while using **4x fewer parameters per block** in sequence mixing, enabling 8 blocks instead of 6.
113
 
114
- ### Tokenizer
115
 
116
- ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
117
 
118
- ### Training Data
119
 
120
- [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
121
 
122
- - **Train tokens**: 794.9M (pre-encoded as `train.bin`)
123
- - **Val tokens**: 88.2M (pre-encoded as `val.bin`)
124
- - **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
 
125
 
126
- ## Files
127
 
128
- | File | Description |
129
- |---|---|
130
- | `final.jld2` | Model parameters (JLD2 format, 74MB) |
131
- | `config.toml` | Architecture config (5m-monarch) |
132
- | `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
133
- | `merges.txt` | BPE merge rules |
134
 
135
- ## Inference API
136
 
137
- The [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
138
 
139
  ```bash
140
- # Streaming
141
  curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
142
  -H "Content-Type: application/json" \
143
- -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
 
 
 
 
 
 
144
 
145
- # Non-streaming
146
- curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
147
- -H "Content-Type: application/json" \
148
- -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ```
150
 
151
- ### Endpoints
152
 
153
- - `GET /` Health check and model info
154
- - `GET /v1/models` — List available models
155
- - `POST /v1/chat/completions` Generate text (streaming + non-streaming)
 
 
 
156
 
157
- ## Framework
158
 
159
- Built with:
160
- - [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks (training)
161
- - [NNlib.jl](https://github.com/FluxML/NNlib.jl) — batched_mul, softmax, activations (inference)
162
- - [Zygote.jl](https://github.com/FluxML/Zygote.jl) Automatic differentiation (training)
163
- - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) GPU acceleration (training)
 
164
 
165
  ## References
166
 
167
- - [Monarch Mixer (Dao et al., 2023)](https://arxiv.org/abs/2310.12109) Sub-quadratic GEMM-based architecture
168
- - [Chinchilla (Hoffmann et al., 2022)](https://arxiv.org/abs/2203.15556) Compute-optimal training scaling
 
 
 
 
 
 
 
 
 
 
 
169
 
170
- ## Related
171
 
172
- - **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** — Canonical model repo (both transformer and monarch variants)
173
- - **[LisaMegaWatts/JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM)** — Transformer variant inference artifacts
174
- - **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** — Transformer inference endpoint
175
- - **[MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM)** — This model's inference endpoint
176
- - **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** — Training dataset + tokenizer
 
1
  ---
2
  language:
3
+ - en
 
4
  license: mit
5
+ library_name: lux
6
  tags:
7
+ - julia
8
+ - lux
9
+ - slm
10
+ - philosophy
11
+ - monarch-mixer
12
+ - sub-quadratic
13
+ - structured-matrix
14
+ - rmsnorm
15
+ - swiglu
16
+ - bpe
17
+ - text-generation
18
+ pipeline_tag: text-generation
19
  datasets:
20
+ - LisaMegaWatts/philosophy-corpus
21
+ model-index:
22
+ - name: MonarchSLM
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ name: Text Generation
27
+ dataset:
28
+ type: LisaMegaWatts/philosophy-corpus
29
+ name: philosophy-corpus
30
+ metrics:
31
+ - type: perplexity
32
+ value: 38.4
33
+ name: Val PPL
34
+ - type: loss
35
+ value: 3.65
36
+ name: Val Loss
37
  ---
38
 
39
+ # MonarchSLM
40
+
41
+ A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.
42
 
43
+ Part of the [Julia SLM](https://github.com/buildwithbooks/julia-slm) family of models exploring alternative sequence mixing architectures.
44
 
45
+ ## Model Family
46
 
47
+ MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:
48
 
49
+ | Model | Architecture | Sequence Mixing | Val PPL | Params |
50
+ |---|---|---|---|---|
51
+ | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
52
+ | **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
53
+ | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
54
 
55
+ ## Architecture
56
 
57
  ```
58
+ JuliaGPTModel (monarch)
59
+ +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
60
+ +-- blocks x 8:
61
+ | +-- ln1: RMSNorm(256)
62
+ | +-- seq_mixer: MonarchSequenceMixer
63
+ | | +-- conv: CausalDepthwiseConv1d(256, kernel=4)
64
+ | | +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
65
+ | | | +-- L1: (16, 16, 16) # block-diagonal factor 1
66
+ | | | +-- L2: (16, 16, 16) # block-diagonal factor 2
67
+ | | +-- gate: LearnedGate(256)
68
+ | +-- ln2: RMSNorm(256)
69
+ | +-- ffn: SwiGLU(256 -> 640 -> 256)
70
+ +-- ln_f: RMSNorm(256)
71
+ +-- head: TiedEmbeddingHead -> (2000,)
72
  ```
73
 
74
+ ### How Monarch Sequence Mixing Works
75
 
76
+ Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:
77
 
78
  ```
79
+ M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
80
  ```
81
 
82
+ where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.
83
+
84
+ **Per-head forward pass:**
85
+
86
+ 1. Realize the T x T mixing matrix M from learned factors L1, L2
87
+ 2. Apply a multiplicative 0/1 causal mask (lower triangular)
88
+ 3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
89
+ 4. A short causal convolution (kernel=4) provides complementary local n-gram context
90
+ 5. Conv and Monarch outputs are combined via a learned sigmoid gate
91
+
92
+ **No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.
93
+
94
+ ### Key Differences from Transformer
95
+
96
+ | Property | Transformer | Monarch Mixer |
97
+ |---|---|---|
98
+ | Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
99
+ | Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
100
+ | Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
101
+ | Seq mixer params/block | 262K | **67K** (74% reduction) |
102
+ | Layers (same param budget) | 6 | **8** (extra layers from param savings) |
103
 
104
+ ### Parameter Efficiency
105
+
106
+ The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:
107
+
108
+ | Component | Params per block |
109
+ |---|---|
110
+ | CausalDepthwiseConv1d (K=4) | 1,024 |
111
+ | 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
112
+ | LearnedGate | 256 |
113
+ | **Total sequence mixing** | **66,816** |
114
+ | SwiGLU FFN | 491,520 |
115
+ | RMSNorm x 2 | 512 |
116
+ | **Block total** | 558,848 |
117
+
118
+ ## Model Details
119
+
120
+ | Parameter | Value |
121
  |---|---|
122
+ | Total parameters | 4,983,040 |
123
  | Embedding dim | 256 |
124
  | Layers | 8 |
125
+ | Monarch heads | 8 |
126
+ | Channels per head | 32 |
127
+ | Block size (p) | 16 (T = p^2 = 256) |
128
+ | Conv kernel size | 4 |
129
+ | FFN hidden dim | 640 |
130
  | Context length | 256 tokens |
131
+ | Vocabulary | 2,000 (ByteLevel BPE) |
132
+ | Position encoding | None (learned in Monarch matrices) |
133
  | Weight tying | Yes |
 
134
 
135
+ ## Training
136
 
137
+ | | Value |
138
  |---|---|
139
+ | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
140
+ | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
141
+ | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
142
+ | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
143
+ | Warmup | 500 steps (linear) |
144
+ | Max steps | 12,305 |
145
  | Batch size | 32 |
146
+ | Gradient clipping | 1.0 (global norm) |
147
+ | Precision | Float16 AMP (Float32 master weights) |
148
+ | Hardware | NVIDIA RTX 3060 12GB |
149
+ | Training time | 89 minutes |
150
  | Throughput | ~19K tok/s |
 
 
151
 
152
+ ### Training Curves
153
 
154
  | Step | Train Loss | Val Loss | Val PPL |
155
+ |---|---|---|---|
156
+ | 500 | 7.28 | 5.58 | 265.4 |
157
+ | 2,000 | 4.29 | 4.21 | 67.6 |
158
+ | 6,000 | 3.83 | 3.81 | 45.3 |
159
+ | 10,000 | 3.69 | 3.68 | 39.6 |
160
+ | 12,305 | 3.66 | **3.65** | **38.4** |
161
 
162
+ ### Key Findings
163
 
164
+ - Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
165
+ - The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
166
+ - The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
167
+ - Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
168
+ - Both models generate coherent English with dialogue, grammar, and philosophical content
 
 
 
169
 
170
+ ## Relationship to Symbiogenesis
171
 
172
+ MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.
173
 
174
+ The biological metaphor: MonarchSLM is like a prokaryote a single-organelle organism. SymbioSLM is the eukaryote multiple organelles fused into one cell.
175
 
176
+ ## Implementation
177
 
178
+ Built entirely in Julia:
179
 
180
+ - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** Explicit-parameter neural network framework
181
+ - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** Automatic differentiation
182
+ - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** GPU acceleration
183
+ - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations
184
 
185
+ Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.
186
 
187
+ Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
188
+
189
+ ## Usage
 
 
 
190
 
191
+ ### OpenAI-Compatible API
192
 
193
+ Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):
194
 
195
  ```bash
 
196
  curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
197
  -H "Content-Type: application/json" \
198
+ -d '{
199
+ "messages": [{"role": "user", "content": "the nature of"}],
200
+ "max_tokens": 200,
201
+ "temperature": 0.8,
202
+ "top_k": 40
203
+ }'
204
+ ```
205
 
206
+ ### Load in Julia
207
+
208
+ ```julia
209
+ using Pkg; Pkg.activate("julia-slm")
210
+ include("src/JuliaGPT.jl")
211
+ using .JuliaGPT; using .JuliaGPT: Lux
212
+
213
+ tok = BPETokenizer("vocab.json", "merges.txt")
214
+ ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
215
+
216
+ model = create_model(ModelConfig(;
217
+ arch="monarch", vocab_size=vocab_size(tok),
218
+ embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
219
+ n_monarch_heads=8, conv_kernel_size=4,
220
+ ffn_mult=4, context_length=256, weight_tying=true,
221
+ ))
222
+
223
+ text = generate(model, ps, st, tok, "the nature of ";
224
+ max_new_tokens=200, temperature=0.8, top_k=40)
225
  ```
226
 
227
+ ## Files
228
 
229
+ | File | Description |
230
+ |---|---|
231
+ | `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
232
+ | `config.toml` | Model architecture configuration |
233
+ | `vocab.json` | BPE vocabulary (2000 tokens) |
234
+ | `merges.txt` | BPE merge rules |
235
 
236
+ ## Provenance
237
 
238
+ - **Author**: LisaMegaWatts
239
+ - **Training code**: [buildwithbooks/julia-slm](https://github.com/buildwithbooks/julia-slm)
240
+ - **Data pipeline**: [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline)
241
+ - **Training date**: February 2026
242
+ - **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
243
+ - **First Julia implementation** of Monarch Mixer sequence mixing
244
 
245
  ## References
246
 
247
+ - Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
248
+ - Karpathy, A. (2023). nanoGPT. GitHub repository.
249
+
250
+ ## Citation
251
+
252
+ ```bibtex
253
+ @misc{monarchslm2026,
254
+ title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
255
+ author={LisaMegaWatts},
256
+ year={2026},
257
+ url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
258
+ }
259
+ ```
260
 
261
+ ## License
262
 
263
+ MIT