JuliaSLM

A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the Julia SLM family of models exploring alternative sequence mixing architectures.

Model Family

JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:

Model Architecture Sequence Mixing Val PPL Params
JuliaSLM Transformer 4-head causal attention + RoPE 34.5 5.04M
MonarchSLM Monarch Mixer 8-head Monarch matrix + conv + gate 38.4 4.98M
SymbioSLM Symbiogenesis 3 organelles (CausalConv + Monarch + LongConv) + gate TBD ~4.1M

Architecture

JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
|   +-- ln1: RMSNorm(256)
|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
|   |   +-- wq, wk, wv: Dense(256 -> 256)
|   |   +-- wo: Dense(256 -> 256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)

Key Design Choices

  • RoPE (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
  • RMSNorm (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
  • SwiGLU FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
  • Weight tying: Input embedding and output projection share the same weight matrix, saving 512K parameters
  • No bias: All linear layers use bias=false for parameter efficiency
  • No dropout: Following Karpathy's recommendation for small models

Model Details

Parameter Value
Total parameters 5,037,312
Embedding dim 256
Layers 6
Attention heads 4
Head dim 64
FFN hidden dim 640
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
Position encoding RoPE
Weight tying Yes

Parameter Breakdown

Component Params %
Token embedding (tied) 512K 10.2%
Attention (Q,K,V,O) x 6 1.57M 31.2%
SwiGLU FFN x 6 2.95M 58.5%
RMSNorm x 13 3.3K <0.1%
Total 5.04M

Training

Value
Dataset philosophy-corpus
Corpus 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens ~100M (Chinchilla-optimal: 20 tok/param)
Optimizer AdamW (lr=6e-4, min_lr=6e-5, cosine decay)
Warmup 500 steps (linear)
Max steps 12,305
Batch size 32
Gradient clipping 1.0 (global norm)
Precision Float16 AMP (Float32 master weights)
Hardware NVIDIA RTX 3060 12GB
Training time 66 minutes
Throughput ~26K tok/s

Training Curves

Step Train Loss Val Loss Val PPL
500 6.69 5.01 149.6
2,000 4.09 4.02 56.0
6,000 3.72 3.70 40.4
10,000 3.58 3.57 35.4
12,305 3.55 3.54 34.5

Implementation

Built entirely in Julia:

  • Lux.jl โ€” Explicit-parameter neural network framework
  • Zygote.jl โ€” Automatic differentiation
  • CUDA.jl โ€” GPU acceleration
  • NNlib.jl โ€” Softmax, activations, batched_mul
  • Optimisers.jl โ€” AdamW with cosine LR

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

Usage

OpenAI-Compatible API

Served via JuliaSLM Space:

curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Load in Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="transformer", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)

Files

File Description
final.jld2 Trained model parameters (JLD2 format)
config.toml Model architecture configuration
vocab.json BPE vocabulary (2000 tokens)
merges.txt BPE merge rules

Provenance

Citation

@misc{juliaslm2026,
  title={JuliaSLM: A Small Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train LisaMegaWatts/JuliaSLM

Space using LisaMegaWatts/JuliaSLM 1

Evaluation results