---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - monarch-mixer
  - sub-quadratic
  - structured-matrix
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: MonarchSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 38.4
            name: Val PPL
          - type: loss
            value: 3.65
            name: Val Loss
---

# MonarchSLM

A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.

Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

## Model Family

MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:

| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |

## Architecture

```
JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- blocks x 8:
|   +-- ln1: RMSNorm(256)
|   +-- seq_mixer: MonarchSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
|   |   +-- gate: LearnedGate(256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```

### How Monarch Sequence Mixing Works

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

```
M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
```

where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

**Per-head forward pass:**

1. Realize the T x T mixing matrix M from learned factors L1, L2
2. Apply a multiplicative 0/1 causal mask (lower triangular)
3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
4. A short causal convolution (kernel=4) provides complementary local n-gram context
5. Conv and Monarch outputs are combined via a learned sigmoid gate

**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.

### Key Differences from Transformer

| Property | Transformer | Monarch Mixer |
|---|---|---|
| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
| Seq mixer params/block | 262K | **67K** (74% reduction) |
| Layers (same param budget) | 6 | **8** (extra layers from param savings) |

### Parameter Efficiency

The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

| Component | Params per block |
|---|---|
| CausalDepthwiseConv1d (K=4) | 1,024 |
| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
| LearnedGate | 256 |
| **Total sequence mixing** | **66,816** |
| SwiGLU FFN | 491,520 |
| RMSNorm x 2 | 512 |
| **Block total** | 558,848 |

## Model Details

| Parameter | Value |
|---|---|
| Total parameters | 4,983,040 |
| Embedding dim | 256 |
| Layers | 8 |
| Monarch heads | 8 |
| Channels per head | 32 |
| Block size (p) | 16 (T = p^2 = 256) |
| Conv kernel size | 4 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | None (learned in Monarch matrices) |
| Weight tying | Yes |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 89 minutes |
| Throughput | ~19K tok/s |

### Training Curves

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | **3.65** | **38.4** |

### Key Findings

- Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
- Both models generate coherent English with dialogue, grammar, and philosophical content

## Relationship to Symbiogenesis

MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.

## Implementation

Built entirely in Julia:

- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations

Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

## Usage

### OpenAI-Compatible API

Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):

```bash
curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

### Load in Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="monarch", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
    n_monarch_heads=8, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
```

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Provenance

- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
- **First Julia implementation** of Monarch Mixer sequence mixing

## References

- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
- Karpathy, A. (2023). nanoGPT. GitHub repository.

## Citation

```bibtex
@misc{monarchslm2026,
  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}
```

## License

MIT