LisaMegaWatts commited on
Commit
d05796a
·
verified ·
1 Parent(s): a90b172

Add model card with architecture details, provenance, and training metrics

Browse files
Files changed (1) hide show
  1. README.md +157 -96
README.md CHANGED
@@ -1,148 +1,209 @@
1
  ---
2
  language:
3
- - en
4
- library_name: julia
5
  license: mit
6
- pipeline_tag: text-generation
7
  tags:
8
- - philosophy
9
- - classical-texts
10
- - julia
11
- - lux
12
- - bpe
13
- - rope
14
- - rmsnorm
15
- - swiglu
16
- - small-language-model
17
- - openai-compatible
18
- - chinchilla
19
  datasets:
20
- - LisaMegaWatts/philosophy-corpus
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ---
22
 
23
- # JuliaSLM — Inference Server Artifacts
24
 
25
- Serving-ready artifacts for the [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM), an OpenAI-compatible inference endpoint for the 5M parameter JuliaSLM transformer.
26
 
27
- For full training details, loss curves, architecture diagrams, and code examples see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
28
 
29
- ## Model Summary
30
 
31
- A 5,037,312 parameter decoder-only transformer trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts.
 
 
 
 
32
 
33
- ### Architecture
34
 
35
  ```
36
- JuliaGPTModel
37
- ├── tok_emb: Embedding(2000 256) # weight-tied with output head
38
- ├── rope: RotaryPositionalEncoding(64)
39
- ├── blocks × 6:
40
- ├── ln1: RMSNorm(256)
41
- ├── attn: MultiHeadAttention(4 heads, 64 dim each)
42
- ├── wq, wk, wv: Dense(256 256)
43
- └── wo: Dense(256 256)
44
- ├── ln2: RMSNorm(256)
45
- └── ffn: SwiGLU(256 1024 256)
46
- │ ├── w1: Dense(256 → 1024) # gate
47
- │ ├── v: Dense(256 1024) # value
48
- │ └── w2: Dense(1024 → 256) # down-project
49
- ├── ln_f: RMSNorm(256)
50
- └── head: TiedEmbeddingHead → (2000,) # shares tok_emb weights
51
  ```
52
 
53
- | Component | Detail |
 
 
 
 
 
 
 
 
 
 
 
54
  |---|---|
55
- | Parameters | 5,037,312 |
56
  | Embedding dim | 256 |
57
  | Layers | 6 |
58
- | Attention heads | 4 (head dim 64) |
59
- | FFN multiplier | 4x (SwiGLU, hidden 1024) |
 
60
  | Context length | 256 tokens |
61
- | Positional encoding | Rotary (RoPE) |
62
- | Normalization | RMSNorm (pre-norm) |
63
  | Weight tying | Yes |
64
- | Bias | None |
65
 
66
- ### Training
 
 
 
 
 
 
 
 
 
 
67
 
68
- | Metric | Value |
69
  |---|---|
70
- | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
71
- | Schedule | Cosine decay with 500-step warmup |
72
- | Precision | Mixed F16/F32 |
 
 
 
73
  | Batch size | 32 |
74
- | Training steps | 12,305 |
75
- | Tokens processed | ~100M |
76
- | Training time | 66 min on RTX 3060 12GB |
 
77
  | Throughput | ~26K tok/s |
78
- | Final val loss | 3.54 |
79
- | Final val PPL | 34.5 |
80
 
81
- ### Loss Curve
82
 
83
  | Step | Train Loss | Val Loss | Val PPL |
84
- |------|-----------|----------|---------|
85
  | 500 | 6.69 | 5.01 | 149.6 |
86
  | 2,000 | 4.09 | 4.02 | 56.0 |
87
  | 6,000 | 3.72 | 3.70 | 40.4 |
88
  | 10,000 | 3.58 | 3.57 | 35.4 |
89
- | 12,305 | 3.55 | 3.54 | 34.5 |
90
 
91
- ### Tokenizer
92
 
93
- ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
94
 
95
- ### Training Data
 
 
 
 
96
 
97
- [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
98
 
99
- - **Train tokens**: 794.9M (pre-encoded as `train.bin`)
100
- - **Val tokens**: 88.2M (pre-encoded as `val.bin`)
101
- - **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
102
 
103
- ## Files
104
 
105
- | File | Description |
106
- |---|---|
107
- | `final.jld2` | Model parameters (JLD2 format, 58MB) |
108
- | `config.toml` | Architecture config (5m-chinchilla) |
109
- | `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
110
- | `merges.txt` | BPE merge rules |
111
-
112
- ## Inference API
113
-
114
- The [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
115
 
116
  ```bash
117
- # Streaming
118
  curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
119
  -H "Content-Type: application/json" \
120
- -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
 
 
 
 
 
 
121
 
122
- # Non-streaming
123
- curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
124
- -H "Content-Type: application/json" \
125
- -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ```
127
 
128
- ### Endpoints
129
 
130
- - `GET /` Health check and model info
131
- - `GET /v1/models` — List available models
132
- - `POST /v1/chat/completions` Generate text (streaming + non-streaming)
 
 
 
133
 
134
- ## Framework
135
 
136
- Built with:
137
- - [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks (training)
138
- - [NNlib.jl](https://github.com/FluxML/NNlib.jl) — Softmax, activations (inference)
139
- - [Zygote.jl](https://github.com/FluxML/Zygote.jl) Automatic differentiation (training)
140
- - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) GPU acceleration (training)
 
 
 
 
 
 
 
 
 
 
 
141
 
142
- ## Related
143
 
144
- - **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** — Canonical model repo (versioned checkpoints, full docs)
145
- - **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** — Live inference endpoint
146
- - **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** — Training dataset + tokenizer
147
- - **[LisaMegaWatts/JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT)** — Predecessor (~5K params, character-level, scalar autograd)
148
- - **[Source code](https://github.com/DavinciDreams/JuliaGPT)** — GitHub repository
 
1
  ---
2
  language:
3
+ - en
 
4
  license: mit
5
+ library_name: lux
6
  tags:
7
+ - julia
8
+ - lux
9
+ - slm
10
+ - philosophy
11
+ - transformer
12
+ - rope
13
+ - rmsnorm
14
+ - swiglu
15
+ - bpe
16
+ - text-generation
17
+ pipeline_tag: text-generation
18
  datasets:
19
+ - LisaMegaWatts/philosophy-corpus
20
+ model-index:
21
+ - name: JuliaSLM
22
+ results:
23
+ - task:
24
+ type: text-generation
25
+ name: Text Generation
26
+ dataset:
27
+ type: LisaMegaWatts/philosophy-corpus
28
+ name: philosophy-corpus
29
+ metrics:
30
+ - type: perplexity
31
+ value: 34.5
32
+ name: Val PPL
33
+ - type: loss
34
+ value: 3.54
35
+ name: Val Loss
36
  ---
37
 
38
+ # JuliaSLM
39
 
40
+ A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/buildwithbooks/julia-slm) family of models exploring alternative sequence mixing architectures.
41
 
42
+ ## Model Family
43
 
44
+ JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:
45
 
46
+ | Model | Architecture | Sequence Mixing | Val PPL | Params |
47
+ |---|---|---|---|---|
48
+ | **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
49
+ | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
50
+ | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
51
 
52
+ ## Architecture
53
 
54
  ```
55
+ JuliaGPTModel (transformer)
56
+ +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
57
+ +-- rope: RotaryPositionalEncoding(64, 256)
58
+ +-- blocks x 6:
59
+ | +-- ln1: RMSNorm(256)
60
+ | +-- attn: CausalSelfAttention(4 heads, 64 dim each)
61
+ | | +-- wq, wk, wv: Dense(256 -> 256)
62
+ | | +-- wo: Dense(256 -> 256)
63
+ | +-- ln2: RMSNorm(256)
64
+ | +-- ffn: SwiGLU(256 -> 640 -> 256)
65
+ +-- ln_f: RMSNorm(256)
66
+ +-- head: TiedEmbeddingHead -> (2000,)
 
 
 
67
  ```
68
 
69
+ ### Key Design Choices
70
+
71
+ - **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
72
+ - **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
73
+ - **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
74
+ - **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
75
+ - **No bias**: All linear layers use bias=false for parameter efficiency
76
+ - **No dropout**: Following Karpathy's recommendation for small models
77
+
78
+ ## Model Details
79
+
80
+ | Parameter | Value |
81
  |---|---|
82
+ | Total parameters | 5,037,312 |
83
  | Embedding dim | 256 |
84
  | Layers | 6 |
85
+ | Attention heads | 4 |
86
+ | Head dim | 64 |
87
+ | FFN hidden dim | 640 |
88
  | Context length | 256 tokens |
89
+ | Vocabulary | 2,000 (ByteLevel BPE) |
90
+ | Position encoding | RoPE |
91
  | Weight tying | Yes |
 
92
 
93
+ ### Parameter Breakdown
94
+
95
+ | Component | Params | % |
96
+ |---|---|---|
97
+ | Token embedding (tied) | 512K | 10.2% |
98
+ | Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
99
+ | SwiGLU FFN x 6 | 2.95M | 58.5% |
100
+ | RMSNorm x 13 | 3.3K | <0.1% |
101
+ | **Total** | **5.04M** | |
102
+
103
+ ## Training
104
 
105
+ | | Value |
106
  |---|---|
107
+ | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
108
+ | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
109
+ | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
110
+ | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
111
+ | Warmup | 500 steps (linear) |
112
+ | Max steps | 12,305 |
113
  | Batch size | 32 |
114
+ | Gradient clipping | 1.0 (global norm) |
115
+ | Precision | Float16 AMP (Float32 master weights) |
116
+ | Hardware | NVIDIA RTX 3060 12GB |
117
+ | Training time | 66 minutes |
118
  | Throughput | ~26K tok/s |
 
 
119
 
120
+ ### Training Curves
121
 
122
  | Step | Train Loss | Val Loss | Val PPL |
123
+ |---|---|---|---|
124
  | 500 | 6.69 | 5.01 | 149.6 |
125
  | 2,000 | 4.09 | 4.02 | 56.0 |
126
  | 6,000 | 3.72 | 3.70 | 40.4 |
127
  | 10,000 | 3.58 | 3.57 | 35.4 |
128
+ | 12,305 | 3.55 | **3.54** | **34.5** |
129
 
130
+ ## Implementation
131
 
132
+ Built entirely in Julia:
133
 
134
+ - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
135
+ - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
136
+ - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
137
+ - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
138
+ - **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR
139
 
140
+ Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
141
 
142
+ ## Usage
 
 
143
 
144
+ ### OpenAI-Compatible API
145
 
146
+ Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):
 
 
 
 
 
 
 
 
 
147
 
148
  ```bash
 
149
  curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
150
  -H "Content-Type: application/json" \
151
+ -d '{
152
+ "messages": [{"role": "user", "content": "the nature of"}],
153
+ "max_tokens": 200,
154
+ "temperature": 0.8,
155
+ "top_k": 40
156
+ }'
157
+ ```
158
 
159
+ ### Load in Julia
160
+
161
+ ```julia
162
+ using Pkg; Pkg.activate("julia-slm")
163
+ include("src/JuliaGPT.jl")
164
+ using .JuliaGPT; using .JuliaGPT: Lux
165
+
166
+ tok = BPETokenizer("vocab.json", "merges.txt")
167
+ ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
168
+
169
+ model = create_model(ModelConfig(;
170
+ arch="transformer", vocab_size=vocab_size(tok),
171
+ embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
172
+ ffn_mult=4, context_length=256, weight_tying=true,
173
+ ))
174
+
175
+ text = generate(model, ps, st, tok, "the nature of ";
176
+ max_new_tokens=200, temperature=0.8, top_k=40)
177
  ```
178
 
179
+ ## Files
180
 
181
+ | File | Description |
182
+ |---|---|
183
+ | `final.jld2` | Trained model parameters (JLD2 format) |
184
+ | `config.toml` | Model architecture configuration |
185
+ | `vocab.json` | BPE vocabulary (2000 tokens) |
186
+ | `merges.txt` | BPE merge rules |
187
 
188
+ ## Provenance
189
 
190
+ - **Author**: LisaMegaWatts
191
+ - **Training code**: [buildwithbooks/julia-slm](https://github.com/buildwithbooks/julia-slm)
192
+ - **Data pipeline**: [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline)
193
+ - **Training date**: February 2026
194
+ - **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl
195
+
196
+ ## Citation
197
+
198
+ ```bibtex
199
+ @misc{juliaslm2026,
200
+ title={JuliaSLM: A Small Language Model in Pure Julia},
201
+ author={LisaMegaWatts},
202
+ year={2026},
203
+ url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
204
+ }
205
+ ```
206
 
207
+ ## License
208
 
209
+ MIT