LisaMegaWatts commited on
Commit
e4c3c9e
Β·
verified Β·
1 Parent(s): 5899605

Add model card with architecture details and training results

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: julia
5
+ license: mit
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - philosophy
9
+ - classical-texts
10
+ - julia
11
+ - lux
12
+ - bpe
13
+ - monarch-mixer
14
+ - rmsnorm
15
+ - swiglu
16
+ - small-language-model
17
+ - openai-compatible
18
+ - chinchilla
19
+ - sub-quadratic
20
+ datasets:
21
+ - LisaMegaWatts/philosophy-corpus
22
+ ---
23
+
24
+ # MonarchSLM β€” Inference Server Artifacts
25
+
26
+ Serving-ready artifacts for the [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM), an OpenAI-compatible inference endpoint for the 5M parameter Monarch Mixer model.
27
+
28
+ For full training details, loss curves, architecture comparison, and code see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
29
+
30
+ ## Model Summary
31
+
32
+ A 4,983,040 parameter decoder-only model using **Monarch Mixer** sequence mixing (Dao et al., 2023) trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts. First known Julia implementation of Monarch Mixer for language modeling.
33
+
34
+ ### Architecture
35
+
36
+ ```
37
+ JuliaGPTModel
38
+ β”œβ”€β”€ tok_emb: Embedding(2000 β†’ 256) # weight-tied with output head
39
+ β”œβ”€β”€ blocks Γ— 8:
40
+ β”‚ β”œβ”€β”€ ln1: RMSNorm(256)
41
+ β”‚ β”œβ”€β”€ seq_mixer: MonarchSequenceMixer
42
+ β”‚ β”‚ β”œβ”€β”€ conv: CausalDepthwiseConv1d(256, kernel=4)
43
+ β”‚ β”‚ β”œβ”€β”€ monarchs Γ— 8: MonarchMatrix(256, L1/L2 ∈ ℝ^{16Γ—16Γ—16})
44
+ β”‚ β”‚ └── gate: LearnedGate(256)
45
+ β”‚ β”œβ”€β”€ ln2: RMSNorm(256)
46
+ β”‚ └── ffn: SwiGLU(256 β†’ 640 β†’ 256)
47
+ β”œβ”€β”€ ln_f: RMSNorm(256)
48
+ └── head: TiedEmbeddingHead β†’ (2000,) # shares tok_emb weights
49
+ ```
50
+
51
+ ### Monarch Matrix
52
+
53
+ A Monarch matrix of size TΓ—T (T=pΒ²=256, p=16) factorizes as:
54
+
55
+ ```
56
+ M = Pα΅€ Β· BlockDiag(L1) Β· P Β· BlockDiag(L2)
57
+ ```
58
+
59
+ - L1, L2: p block-diagonal matrices of size pΓ—p
60
+ - P: reshape-transpose permutation
61
+ - **Parameters per head**: 2pΒ³ = 8,192 (vs 65,536 for dense TΒ²)
62
+
63
+ | Component | Detail |
64
+ |---|---|
65
+ | Parameters | 4,983,040 |
66
+ | Embedding dim | 256 |
67
+ | Layers | 8 |
68
+ | Monarch heads | 8 (each mixing 32 channels over 256 positions) |
69
+ | Conv kernel | 4 (causal depthwise) |
70
+ | FFN multiplier | 4x (SwiGLU, hidden 640) |
71
+ | Context length | 256 tokens |
72
+ | Normalization | RMSNorm (pre-norm) |
73
+ | Weight tying | Yes |
74
+ | Bias | None |
75
+
76
+ ### Training
77
+
78
+ | Metric | Value |
79
+ |---|---|
80
+ | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
81
+ | Schedule | Cosine decay with 500-step warmup |
82
+ | Precision | Mixed F16/F32 |
83
+ | Batch size | 32 |
84
+ | Training steps | 12,305 |
85
+ | Tokens processed | ~100M |
86
+ | Training time | 89 min on RTX 3060 12GB |
87
+ | Throughput | ~19K tok/s |
88
+ | Final val loss | 3.65 |
89
+ | Final val PPL | 38.4 |
90
+
91
+ ### Loss Curve
92
+
93
+ | Step | Train Loss | Val Loss | Val PPL |
94
+ |------|-----------|----------|---------|
95
+ | 500 | 6.31 | 5.26 | 192.4 |
96
+ | 2,000 | 4.15 | 4.15 | 63.4 |
97
+ | 6,000 | 3.77 | 3.79 | 44.3 |
98
+ | 10,000 | 3.62 | 3.67 | 39.3 |
99
+ | 12,305 | 3.62 | 3.65 | 38.4 |
100
+
101
+ ### Comparison with Transformer Baseline
102
+
103
+ | Metric | Transformer | Monarch Mixer |
104
+ |---|---|---|
105
+ | Parameters | 5,037,312 | 4,983,040 |
106
+ | Blocks | 6 | 8 |
107
+ | Val Loss | **3.54** | 3.65 |
108
+ | Val PPL | **34.5** | 38.4 |
109
+ | Training time | 66 min | 89 min |
110
+ | Seq mixing params/block | 262K | 67K (4x fewer) |
111
+
112
+ Monarch reaches **94% of baseline quality** while using **4x fewer parameters per block** in sequence mixing, enabling 8 blocks instead of 6.
113
+
114
+ ### Tokenizer
115
+
116
+ ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
117
+
118
+ ### Training Data
119
+
120
+ [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) β€” 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
121
+
122
+ - **Train tokens**: 794.9M (pre-encoded as `train.bin`)
123
+ - **Val tokens**: 88.2M (pre-encoded as `val.bin`)
124
+ - **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
125
+
126
+ ## Files
127
+
128
+ | File | Description |
129
+ |---|---|
130
+ | `final.jld2` | Model parameters (JLD2 format, 74MB) |
131
+ | `config.toml` | Architecture config (5m-monarch) |
132
+ | `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
133
+ | `merges.txt` | BPE merge rules |
134
+
135
+ ## Inference API
136
+
137
+ The [MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
138
+
139
+ ```bash
140
+ # Streaming
141
+ curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
142
+ -H "Content-Type: application/json" \
143
+ -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
144
+
145
+ # Non-streaming
146
+ curl -X POST https://lisamegawatts-monarchslm.hf.space/v1/chat/completions \
147
+ -H "Content-Type: application/json" \
148
+ -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
149
+ ```
150
+
151
+ ### Endpoints
152
+
153
+ - `GET /` β€” Health check and model info
154
+ - `GET /v1/models` β€” List available models
155
+ - `POST /v1/chat/completions` β€” Generate text (streaming + non-streaming)
156
+
157
+ ## Framework
158
+
159
+ Built with:
160
+ - [Lux.jl](https://github.com/LuxDL/Lux.jl) β€” Explicit-parameter neural networks (training)
161
+ - [NNlib.jl](https://github.com/FluxML/NNlib.jl) β€” batched_mul, softmax, activations (inference)
162
+ - [Zygote.jl](https://github.com/FluxML/Zygote.jl) β€” Automatic differentiation (training)
163
+ - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) β€” GPU acceleration (training)
164
+
165
+ ## References
166
+
167
+ - [Monarch Mixer (Dao et al., 2023)](https://arxiv.org/abs/2310.12109) β€” Sub-quadratic GEMM-based architecture
168
+ - [Chinchilla (Hoffmann et al., 2022)](https://arxiv.org/abs/2203.15556) β€” Compute-optimal training scaling
169
+
170
+ ## Related
171
+
172
+ - **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** β€” Canonical model repo (both transformer and monarch variants)
173
+ - **[LisaMegaWatts/JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM)** β€” Transformer variant inference artifacts
174
+ - **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** β€” Transformer inference endpoint
175
+ - **[MonarchSLM Space](https://huggingface.co/spaces/LisaMegaWatts/MonarchSLM)** β€” This model's inference endpoint
176
+ - **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** β€” Training dataset + tokenizer