Fix GitHub links: buildwithbooks -> DavinciDreams

e57e239 verified 4 days ago

9.01 kB

	---
	language:
	- en
	license: mit
	library_name: lux
	tags:
	- julia
	- lux
	- slm
	- philosophy
	- monarch-mixer
	- sub-quadratic
	- structured-matrix
	- rmsnorm
	- swiglu
	- bpe
	- text-generation
	pipeline_tag: text-generation
	datasets:
	- LisaMegaWatts/philosophy-corpus
	model-index:
	- name: MonarchSLM
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: LisaMegaWatts/philosophy-corpus
	name: philosophy-corpus
	metrics:
	- type: perplexity
	value: 38.4
	name: Val PPL
	- type: loss
	value: 3.65
	name: Val Loss
	---

	# MonarchSLM

	A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the first Monarch Mixer implementation in Julia.

	Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

	## Model Family

	MonarchSLM is the Monarch Mixer variant in a family of three architectures trained on the same data with matched parameter budgets:

	\| Model \| Architecture \| Sequence Mixing \| Val PPL \| Params \|
	\|---\|---\|---\|---\|---\|
	\| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) \| Transformer \| 4-head causal attention + RoPE \| 34.5 \| 5.04M \|
	\| MonarchSLM \| Monarch Mixer \| 8-head Monarch matrix + conv + gate \| 38.4 \| 4.98M \|
	\| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) \| Symbiogenesis \| 3 organelles (CausalConv + Monarch + LongConv) + gate \| TBD \| ~4.1M \|

	## Architecture

	```
	JuliaGPTModel (monarch)
	+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
	+-- blocks x 8:
	\| +-- ln1: RMSNorm(256)
	\| +-- seq_mixer: MonarchSequenceMixer
	\| \| +-- conv: CausalDepthwiseConv1d(256, kernel=4)
	\| \| +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
	\| \| \| +-- L1: (16, 16, 16) # block-diagonal factor 1
	\| \| \| +-- L2: (16, 16, 16) # block-diagonal factor 2
	\| \| +-- gate: LearnedGate(256)
	\| +-- ln2: RMSNorm(256)
	\| +-- ffn: SwiGLU(256 -> 640 -> 256)
	+-- ln_f: RMSNorm(256)
	+-- head: TiedEmbeddingHead -> (2000,)
	```

	### How Monarch Sequence Mixing Works

	Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

	```
	M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
	```

	where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

	Per-head forward pass:

	1. Realize the T x T mixing matrix M from learned factors L1, L2
	2. Apply a multiplicative 0/1 causal mask (lower triangular)
	3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
	4. A short causal convolution (kernel=4) provides complementary local n-gram context
	5. Conv and Monarch outputs are combined via a learned sigmoid gate

	No positional encoding needed — the Monarch matrices learn position-dependent mixing patterns directly.

	### Key Differences from Transformer

	\| Property \| Transformer \| Monarch Mixer \|
	\|---\|---\|---\|
	\| Sequence mixing \| Dynamic (input-dependent attention) \| Fixed (learned mixing matrices) \|
	\| Position encoding \| RoPE (separate) \| None (implicit in Monarch matrices) \|
	\| Complexity \| O(T^2 * D) \| O(T^(3/2)) realize + O(T^2) apply \|
	\| Seq mixer params/block \| 262K \| 67K (74% reduction) \|
	\| Layers (same param budget) \| 6 \| 8 (extra layers from param savings) \|

	### Parameter Efficiency

	The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

	\| Component \| Params per block \|
	\|---\|---\|
	\| CausalDepthwiseConv1d (K=4) \| 1,024 \|
	\| 8 x MonarchMatrix (2 x 16^3 each) \| 65,536 \|
	\| LearnedGate \| 256 \|
	\| Total sequence mixing \| 66,816 \|
	\| SwiGLU FFN \| 491,520 \|
	\| RMSNorm x 2 \| 512 \|
	\| Block total \| 558,848 \|

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Total parameters \| 4,983,040 \|
	\| Embedding dim \| 256 \|
	\| Layers \| 8 \|
	\| Monarch heads \| 8 \|
	\| Channels per head \| 32 \|
	\| Block size (p) \| 16 (T = p^2 = 256) \|
	\| Conv kernel size \| 4 \|
	\| FFN hidden dim \| 640 \|
	\| Context length \| 256 tokens \|
	\| Vocabulary \| 2,000 (ByteLevel BPE) \|
	\| Position encoding \| None (learned in Monarch matrices) \|
	\| Weight tying \| Yes \|

	## Training

	\| \| Value \|
	\|---\|---\|
	\| Dataset \| [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) \|
	\| Corpus \| 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) \|
	\| Train tokens \| ~100M (Chinchilla-optimal: 20 tok/param) \|
	\| Optimizer \| AdamW (lr=6e-4, min_lr=6e-5, cosine decay) \|
	\| Warmup \| 500 steps (linear) \|
	\| Max steps \| 12,305 \|
	\| Batch size \| 32 \|
	\| Gradient clipping \| 1.0 (global norm) \|
	\| Precision \| Float16 AMP (Float32 master weights) \|
	\| Hardware \| NVIDIA RTX 3060 12GB \|
	\| Training time \| 89 minutes \|
	\| Throughput \| ~19K tok/s \|

	### Training Curves

	\| Step \| Train Loss \| Val Loss \| Val PPL \|
	\|---\|---\|---\|---\|
	\| 500 \| 7.28 \| 5.58 \| 265.4 \|
	\| 2,000 \| 4.29 \| 4.21 \| 67.6 \|
	\| 6,000 \| 3.83 \| 3.81 \| 45.3 \|
	\| 10,000 \| 3.69 \| 3.68 \| 39.6 \|
	\| 12,305 \| 3.66 \| 3.65 \| 38.4 \|

	### Key Findings

	- Monarch Mixer achieves 89% of the baseline Transformer quality at the same parameter budget
	- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
	- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
	- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
	- Both models generate coherent English with dialogue, grammar, and philosophical content

	## Relationship to Symbiogenesis

	MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

	The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.

	## Implementation

	Built entirely in Julia:

	- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural network framework
	- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation
	- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration
	- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — batched_mul for Monarch realization, softmax, activations

	Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

	Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

	## Usage

	### OpenAI-Compatible API

	Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):

	```bash
	curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [{"role": "user", "content": "the nature of"}],
	"max_tokens": 200,
	"temperature": 0.8,
	"top_k": 40
	}'
	```

	### Load in Julia

	```julia
	using Pkg; Pkg.activate("julia-slm")
	include("src/JuliaGPT.jl")
	using .JuliaGPT; using .JuliaGPT: Lux

	tok = BPETokenizer("vocab.json", "merges.txt")
	ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

	model = create_model(ModelConfig(;
	arch="monarch", vocab_size=vocab_size(tok),
	embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
	n_monarch_heads=8, conv_kernel_size=4,
	ffn_mult=4, context_length=256, weight_tying=true,
	))

	text = generate(model, ps, st, tok, "the nature of ";
	max_new_tokens=200, temperature=0.8, top_k=40)
	```

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `final.jld2` \| Trained model parameters (JLD2 format, 74MB) \|
	\| `config.toml` \| Model architecture configuration \|
	\| `vocab.json` \| BPE vocabulary (2000 tokens) \|
	\| `merges.txt` \| BPE merge rules \|

	## Provenance

	- Author: LisaMegaWatts
	- Training code: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
	- Data pipeline: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
	- Training date: February 2026
	- Architecture reference: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
	- First Julia implementation of Monarch Mixer sequence mixing

	## References

	- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. NeurIPS 2023.
	- Karpathy, A. (2023). nanoGPT. GitHub repository.

	## Citation

	```bibtex
	@misc{monarchslm2026,
	title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
	author={LisaMegaWatts},
	year={2026},
	url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
	}
	```

	## License

	MIT