JuliaSLM

A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the Julia SLM family of models exploring alternative sequence mixing architectures.

Model Family

JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:

Model	Architecture	Sequence Mixing	Val PPL	Params
JuliaSLM	Transformer	4-head causal attention + RoPE	34.5	5.04M
MonarchSLM	Monarch Mixer	8-head Monarch matrix + conv + gate	38.4	4.98M
SymbioSLM	Symbiogenesis	3 organelles (CausalConv + Monarch + LongConv) + gate	TBD	~4.1M

Architecture

JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
|   +-- ln1: RMSNorm(256)
|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
|   |   +-- wq, wk, wv: Dense(256 -> 256)
|   |   +-- wo: Dense(256 -> 256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)

Key Design Choices

RoPE (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
RMSNorm (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
SwiGLU FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
Weight tying: Input embedding and output projection share the same weight matrix, saving 512K parameters
No bias: All linear layers use bias=false for parameter efficiency
No dropout: Following Karpathy's recommendation for small models

Model Details

Parameter	Value
Total parameters	5,037,312
Embedding dim	256
Layers	6
Attention heads	4
Head dim	64
FFN hidden dim	640
Context length	256 tokens
Vocabulary	2,000 (ByteLevel BPE)
Position encoding	RoPE
Weight tying	Yes

Parameter Breakdown

Component	Params	%
Token embedding (tied)	512K	10.2%
Attention (Q,K,V,O) x 6	1.57M	31.2%
SwiGLU FFN x 6	2.95M	58.5%
RMSNorm x 13	3.3K	<0.1%
Total	5.04M

Training

	Value
Dataset	philosophy-corpus
Corpus	981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens	~100M (Chinchilla-optimal: 20 tok/param)
Optimizer	AdamW (lr=6e-4, min_lr=6e-5, cosine decay)
Warmup	500 steps (linear)
Max steps	12,305
Batch size	32
Gradient clipping	1.0 (global norm)
Precision	Float16 AMP (Float32 master weights)
Hardware	NVIDIA RTX 3060 12GB
Training time	66 minutes
Throughput	~26K tok/s

Training Curves

Step	Train Loss	Val Loss	Val PPL
500	6.69	5.01	149.6
2,000	4.09	4.02	56.0
6,000	3.72	3.70	40.4
10,000	3.58	3.57	35.4
12,305	3.55	3.54	34.5

Implementation

Built entirely in Julia:

Lux.jl — Explicit-parameter neural network framework
Zygote.jl — Automatic differentiation
CUDA.jl — GPU acceleration
NNlib.jl — Softmax, activations, batched_mul
Optimisers.jl — AdamW with cosine LR

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

Usage

OpenAI-Compatible API

Served via JuliaSLM Space:

curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Load in Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="transformer", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)

Files

File	Description
`final.jld2`	Trained model parameters (JLD2 format)
`config.toml`	Model architecture configuration
`vocab.json`	BPE vocabulary (2000 tokens)
`merges.txt`	BPE merge rules

Provenance

Author: LisaMegaWatts
Training code: DavinciDreams/julia-slm
Data pipeline: DavinciDreams/text-pipeline
Training date: February 2026
Architecture reference: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl

Citation

@misc{juliaslm2026,
  title={JuliaSLM: A Small Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train LisaMegaWatts/JuliaSLM

Space using LisaMegaWatts/JuliaSLM 1

Evaluation results

Val PPL on philosophy-corpus
self-reported

34.500
Val Loss on philosophy-corpus
self-reported

3.540