JuliaSLM
A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the Julia SLM family of models exploring alternative sequence mixing architectures.
Model Family
JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| JuliaSLM | Transformer | 4-head causal attention + RoPE | 34.5 | 5.04M |
| MonarchSLM | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| SymbioSLM | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
Architecture
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
| +-- ln1: RMSNorm(256)
| +-- attn: CausalSelfAttention(4 heads, 64 dim each)
| | +-- wq, wk, wv: Dense(256 -> 256)
| | +-- wo: Dense(256 -> 256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
Key Design Choices
- RoPE (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
- RMSNorm (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
- SwiGLU FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
- Weight tying: Input embedding and output projection share the same weight matrix, saving 512K parameters
- No bias: All linear layers use bias=false for parameter efficiency
- No dropout: Following Karpathy's recommendation for small models
Model Details
| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |
Parameter Breakdown
| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| Total | 5.04M |
Training
| Value | |
|---|---|
| Dataset | philosophy-corpus |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |
Training Curves
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | 3.54 | 34.5 |
Implementation
Built entirely in Julia:
- Lux.jl โ Explicit-parameter neural network framework
- Zygote.jl โ Automatic differentiation
- CUDA.jl โ GPU acceleration
- NNlib.jl โ Softmax, activations, batched_mul
- Optimisers.jl โ AdamW with cosine LR
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
Usage
OpenAI-Compatible API
Served via JuliaSLM Space:
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
Load in Julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="transformer", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
Files
| File | Description |
|---|---|
final.jld2 |
Trained model parameters (JLD2 format) |
config.toml |
Model architecture configuration |
vocab.json |
BPE vocabulary (2000 tokens) |
merges.txt |
BPE merge rules |
Provenance
- Author: LisaMegaWatts
- Training code: DavinciDreams/julia-slm
- Data pipeline: DavinciDreams/text-pipeline
- Training date: February 2026
- Architecture reference: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl
Citation
@misc{juliaslm2026,
title={JuliaSLM: A Small Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
License
MIT
Dataset used to train LisaMegaWatts/JuliaSLM
Space using LisaMegaWatts/JuliaSLM 1
Evaluation results
- Val PPL on philosophy-corpusself-reported34.500
- Val Loss on philosophy-corpusself-reported3.540