LisaMegaWatts
/

MonarchSLM

+---
+language:
+- en
+library_name: julia
+license: mit
+pipeline_tag: text-generation
+tags:
+- philosophy
+- classical-texts
+- julia
+- lux
+- bpe
+- monarch-mixer
+- rmsnorm
+- swiglu
+- small-language-model
+- openai-compatible
+- chinchilla
+- sub-quadratic
+datasets:
+- LisaMegaWatts/philosophy-corpus
+---
+# MonarchSLM — Inference Server Artifacts
+Serving-ready artifacts for the [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM), an OpenAI-compatible inference endpoint for the 5M parameter Monarch Mixer model.
+For full training details, loss curves, architecture comparison, and code see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
+## Model Summary
+A 4,983,040 parameter decoder-only model using **Monarch Mixer** sequence mixing (Dao et al., 2023) trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts. First known Julia implementation of Monarch Mixer for language modeling.
+### Architecture
+```
+JuliaGPTModel
+├── tok_emb: Embedding(2000 → 256)     # weight-tied with output head
+├── blocks × 8:
+│   ├── ln1: RMSNorm(256)
+│   ├── seq_mixer: MonarchSequenceMixer
+│   │   ├── conv: CausalDepthwiseConv1d(256, kernel=4)
+│   │   ├── monarchs × 8: MonarchMatrix(256, L1/L2 ∈ ℝ^{16×16×16})
+│   │   └── gate: LearnedGate(256)
+│   ├── ln2: RMSNorm(256)
+│   └── ffn: SwiGLU(256 → 640 → 256)
+├── ln_f: RMSNorm(256)
+└── head: TiedEmbeddingHead → (2000,)  # shares tok_emb weights
+```
+### Monarch Matrix
+A Monarch matrix of size T×T (T=p²=256, p=16) factorizes as:
+```
+M = Pᵀ · BlockDiag(L1) · P · BlockDiag(L2)
+```
+- L1, L2: p block-diagonal matrices of size p×p
+- P: reshape-transpose permutation
+- **Parameters per head**: 2p³ = 8,192 (vs 65,536 for dense T²)
+| Component | Detail |
+|---|---|
+| Parameters | 4,983,040 |
+| Embedding dim | 256 |
+| Layers | 8 |
+| Monarch heads | 8 (each mixing 32 channels over 256 positions) |
+| Conv kernel | 4 (causal depthwise) |
+| FFN multiplier | 4x (SwiGLU, hidden 640) |
+| Context length | 256 tokens |
+| Normalization | RMSNorm (pre-norm) |
+| Weight tying | Yes |
+| Bias | None |
+### Training
+| Metric | Value |
+|---|---|
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
+| Schedule | Cosine decay with 500-step warmup |
+| Precision | Mixed F16/F32 |
+| Batch size | 32 |
+| Training steps | 12,305 |
+| Tokens processed | ~100M |
+| Training time | 89 min on RTX 3060 12GB |
+| Throughput | ~19K tok/s |
+| Final val loss | 3.65 |
+| Final val PPL | 38.4 |
+### Loss Curve
+| Step | Train Loss | Val Loss | Val PPL |
+|------|-----------|----------|---------|
+| 500 | 6.31 | 5.26 | 192.4 |
+| 2,000 | 4.15 | 4.15 | 63.4 |
+| 6,000 | 3.77 | 3.79 | 44.3 |
+| 10,000 | 3.62 | 3.67 | 39.3 |
+| 12,305 | 3.62 | 3.65 | 38.4 |
+### Comparison with Transformer Baseline
+| Metric | Transformer | Monarch Mixer |
+|---|---|---|
+| Parameters | 5,037,312 | 4,983,040 |
+| Blocks | 6 | 8 |
+| Val Loss | **3.54** | 3.65 |
+| Val PPL | **34.5** | 38.4 |
+| Training time | 66 min | 89 min |
+| Seq mixing params/block | 262K | 67K (4x fewer) |
+Monarch reaches **94% of baseline quality** while using **4x fewer parameters per block** in sequence mixing, enabling 8 blocks instead of 6.
+### Tokenizer
+ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
+### Training Data
+[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) — 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
+- **Train tokens**: 794.9M (pre-encoded as `train.bin`)
+- **Val tokens**: 88.2M (pre-encoded as `val.bin`)
+- **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
+## Files
+| File | Description |
+|---|---|
+| `final.jld2` | Model parameters (JLD2 format, 74MB) |
+| `config.toml` | Architecture config (5m-monarch) |
+| `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
+| `merges.txt` | BPE merge rules |
+## Inference API
+The [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
+```bash
+# Streaming
+curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
+# Non-streaming
+curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
+```
+### Endpoints
+- `GET /` — Health check and model info
+- `GET /v1/models` — List available models
+- `POST /v1/chat/completions` — Generate text (streaming + non-streaming)
+## Framework
+Built with:
+- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks (training)
+- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — batched_mul, softmax, activations (inference)
+- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation (training)
+- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration (training)
+## References
+- [Monarch Mixer (Dao et al., 2023)](https://arxiv.org/abs/2310.12109) — Sub-quadratic GEMM-based architecture
+- [Chinchilla (Hoffmann et al., 2022)](https://arxiv.org/abs/2203.15556) — Compute-optimal training scaling
+## Related
+- **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** — Canonical model repo (both transformer and monarch variants)
+- **[LisaMegaWatts/JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM)** — Transformer variant inference artifacts
+- **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** — Transformer inference endpoint
+- **[MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM)** — This model's inference endpoint
+- **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** — Training dataset + tokenizer