File size: 6,413 Bytes
98ca52c d05796a 98da2a6 d05796a 98ca52c d05796a 98ca52c d05796a 98ca52c d05796a 98ca52c eaeccd0 98ca52c d05796a 98da2a6 d05796a 98ca52c d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a 98ca52c d05796a 98ca52c d05796a 98ca52c d05796a 98da2a6 a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a 98ca52c d05796a 98ca52c d05796a 98ca52c d05796a 98ca52c d05796a 98da2a6 98ca52c d05796a a90b172 d05796a a90b172 d05796a a90b172 d05796a eaeccd0 d05796a 98ca52c d05796a 98ca52c d05796a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
language:
- en
license: mit
library_name: lux
tags:
- julia
- lux
- slm
- philosophy
- transformer
- rope
- rmsnorm
- swiglu
- bpe
- text-generation
pipeline_tag: text-generation
datasets:
- LisaMegaWatts/philosophy-corpus
model-index:
- name: JuliaSLM
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: LisaMegaWatts/philosophy-corpus
name: philosophy-corpus
metrics:
- type: perplexity
value: 34.5
name: Val PPL
- type: loss
value: 3.54
name: Val Loss
---
# JuliaSLM
A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.
## Model Family
JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
## Architecture
```
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
| +-- ln1: RMSNorm(256)
| +-- attn: CausalSelfAttention(4 heads, 64 dim each)
| | +-- wq, wk, wv: Dense(256 -> 256)
| | +-- wo: Dense(256 -> 256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```
### Key Design Choices
- **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
- **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
- **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
- **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
- **No bias**: All linear layers use bias=false for parameter efficiency
- **No dropout**: Following Karpathy's recommendation for small models
## Model Details
| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |
### Parameter Breakdown
| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| **Total** | **5.04M** | |
## Training
| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |
### Training Curves
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | **3.54** | **34.5** |
## Implementation
Built entirely in Julia:
- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
- **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
## Usage
### OpenAI-Compatible API
Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):
```bash
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
```
### Load in Julia
```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="transformer", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
```
## Files
| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |
## Provenance
- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl
## Citation
```bibtex
@misc{juliaslm2026,
title={JuliaSLM: A Small Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
```
## License
MIT
|