File size: 9,014 Bytes

e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
 
 
e4c3c9e
e57e239
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
 
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e4c3c9e
 
 
 
3bf5aa2
 
 
 
 
 
 
e4c3c9e
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
 
3bf5aa2
e4c3c9e
3bf5aa2
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2
e57e239
 
3bf5aa2
 
 
e4c3c9e
 
 
3bf5aa2
 
 
 
 
 
 
 
 
 
 
 
 
e4c3c9e
3bf5aa2
e4c3c9e
3bf5aa2

---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - monarch-mixer
  - sub-quadratic
  - structured-matrix
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: MonarchSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 38.4
            name: Val PPL
          - type: loss
            value: 3.65
            name: Val Loss
---

# MonarchSLM

A 4.98M parameter decoder-only Monarch Mixer model trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. To our knowledge, this is the **first Monarch Mixer implementation in Julia**.

Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

## Model Family

MonarchSLM is the **Monarch Mixer variant** in a family of three architectures trained on the same data with matched parameter budgets:

| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| **MonarchSLM** | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |

## Architecture

```
JuliaGPTModel (monarch)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- blocks x 8:
|   +-- ln1: RMSNorm(256)
|   +-- seq_mixer: MonarchSequenceMixer
|   |   +-- conv: CausalDepthwiseConv1d(256, kernel=4)
|   |   +-- monarchs: 8 x MonarchMatrix(T=256, p=16)
|   |   |   +-- L1: (16, 16, 16)  # block-diagonal factor 1
|   |   |   +-- L2: (16, 16, 16)  # block-diagonal factor 2
|   |   +-- gate: LearnedGate(256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```

### How Monarch Sequence Mixing Works

Monarch matrices (Dao et al., 2023) factorize a T x T mixing matrix as:

```
M = P^T * BlockDiag(L1) * P * BlockDiag(L2)
```

where T = p^2 (T=256, p=16), P is a reshape-transpose permutation, and L1, L2 are (p, p, p) tensors of p block-diagonal p x p matrices.

**Per-head forward pass:**

1. Realize the T x T mixing matrix M from learned factors L1, L2
2. Apply a multiplicative 0/1 causal mask (lower triangular)
3. Multiply: each head's channel slice (32 channels) is mixed across the sequence dimension
4. A short causal convolution (kernel=4) provides complementary local n-gram context
5. Conv and Monarch outputs are combined via a learned sigmoid gate

**No positional encoding needed** — the Monarch matrices learn position-dependent mixing patterns directly.

### Key Differences from Transformer

| Property | Transformer | Monarch Mixer |
|---|---|---|
| Sequence mixing | Dynamic (input-dependent attention) | Fixed (learned mixing matrices) |
| Position encoding | RoPE (separate) | None (implicit in Monarch matrices) |
| Complexity | O(T^2 * D) | O(T^(3/2)) realize + O(T^2) apply |
| Seq mixer params/block | 262K | **67K** (74% reduction) |
| Layers (same param budget) | 6 | **8** (extra layers from param savings) |

### Parameter Efficiency

The 74% reduction in sequence mixing parameters (67K vs 262K per block) enables 2 extra layers at the same total parameter budget:

| Component | Params per block |
|---|---|
| CausalDepthwiseConv1d (K=4) | 1,024 |
| 8 x MonarchMatrix (2 x 16^3 each) | 65,536 |
| LearnedGate | 256 |
| **Total sequence mixing** | **66,816** |
| SwiGLU FFN | 491,520 |
| RMSNorm x 2 | 512 |
| **Block total** | 558,848 |

## Model Details

| Parameter | Value |
|---|---|
| Total parameters | 4,983,040 |
| Embedding dim | 256 |
| Layers | 8 |
| Monarch heads | 8 |
| Channels per head | 32 |
| Block size (p) | 16 (T = p^2 = 256) |
| Conv kernel size | 4 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | None (learned in Monarch matrices) |
| Weight tying | Yes |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 89 minutes |
| Throughput | ~19K tok/s |

### Training Curves

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 7.28 | 5.58 | 265.4 |
| 2,000 | 4.29 | 4.21 | 67.6 |
| 6,000 | 3.83 | 3.81 | 45.3 |
| 10,000 | 3.69 | 3.68 | 39.6 |
| 12,305 | 3.66 | **3.65** | **38.4** |

### Key Findings

- Monarch Mixer achieves **89% of the baseline Transformer quality** at the same parameter budget
- The 4x parameter reduction in sequence mixing (67K vs 262K per block) enables 2 extra layers
- The model learns coherent language generation using only fixed learned mixing patterns — no dynamic attention
- Throughput is 27% lower than Transformer due to Monarch matrix realization overhead
- Both models generate coherent English with dialogue, grammar, and philosophical content

## Relationship to Symbiogenesis

MonarchSLM's Monarch matrices serve as one of three "organelles" in the [Symbiogenesis](https://huggingface.co/LisaMegaWatts/SymbioSLM) architecture. In Symbiogenesis, Monarch provides the global sub-quadratic mixing component alongside CausalConv (local patterns) and LongConv (dense causal filtering), all fused via a learned per-channel OrganelleGate.

The biological metaphor: MonarchSLM is like a prokaryote — a single-organelle organism. SymbioSLM is the eukaryote — multiple organelles fused into one cell.

## Implementation

Built entirely in Julia:

- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — batched_mul for Monarch realization, softmax, activations

Monarch matrix realization uses `NNlib.batched_mul` for the block-diagonal matrix multiplications, making it fully differentiable through Zygote.

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

## Usage

### OpenAI-Compatible API

Served via [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM):

```bash
curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

### Load in Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="monarch", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=8, n_heads=4, head_dim=64,
    n_monarch_heads=8, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
```

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format, 74MB) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Provenance

- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: Monarch Mixer (Dao et al., 2023), adapted for Julia/Lux.jl
- **First Julia implementation** of Monarch Mixer sequence mixing

## References

- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
- Karpathy, A. (2023). nanoGPT. GitHub repository.

## Citation

```bibtex
@misc{monarchslm2026,
  title={MonarchSLM: A Monarch Mixer Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MonarchSLM}
}
```

## License

MIT