--- language: - en license: mit library_name: lux tags: - julia - lux - slm - philosophy - transformer - rope - rmsnorm - swiglu - bpe - text-generation pipeline_tag: text-generation datasets: - LisaMegaWatts/philosophy-corpus model-index: - name: JuliaSLM results: - task: type: text-generation name: Text Generation dataset: type: LisaMegaWatts/philosophy-corpus name: philosophy-corpus metrics: - type: perplexity value: 34.5 name: Val PPL - type: loss value: 3.54 name: Val Loss --- # JuliaSLM A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures. ## Model Family JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets: | Model | Architecture | Sequence Mixing | Val PPL | Params | |---|---|---|---|---| | **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M | | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M | ## Architecture ``` JuliaGPTModel (transformer) +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head] +-- rope: RotaryPositionalEncoding(64, 256) +-- blocks x 6: | +-- ln1: RMSNorm(256) | +-- attn: CausalSelfAttention(4 heads, 64 dim each) | | +-- wq, wk, wv: Dense(256 -> 256) | | +-- wo: Dense(256 -> 256) | +-- ln2: RMSNorm(256) | +-- ffn: SwiGLU(256 -> 640 -> 256) +-- ln_f: RMSNorm(256) +-- head: TiedEmbeddingHead -> (2000,) ``` ### Key Design Choices - **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization - **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer - **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64 - **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters - **No bias**: All linear layers use bias=false for parameter efficiency - **No dropout**: Following Karpathy's recommendation for small models ## Model Details | Parameter | Value | |---|---| | Total parameters | 5,037,312 | | Embedding dim | 256 | | Layers | 6 | | Attention heads | 4 | | Head dim | 64 | | FFN hidden dim | 640 | | Context length | 256 tokens | | Vocabulary | 2,000 (ByteLevel BPE) | | Position encoding | RoPE | | Weight tying | Yes | ### Parameter Breakdown | Component | Params | % | |---|---|---| | Token embedding (tied) | 512K | 10.2% | | Attention (Q,K,V,O) x 6 | 1.57M | 31.2% | | SwiGLU FFN x 6 | 2.95M | 58.5% | | RMSNorm x 13 | 3.3K | <0.1% | | **Total** | **5.04M** | | ## Training | | Value | |---|---| | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) | | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) | | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) | | Warmup | 500 steps (linear) | | Max steps | 12,305 | | Batch size | 32 | | Gradient clipping | 1.0 (global norm) | | Precision | Float16 AMP (Float32 master weights) | | Hardware | NVIDIA RTX 3060 12GB | | Training time | 66 minutes | | Throughput | ~26K tok/s | ### Training Curves | Step | Train Loss | Val Loss | Val PPL | |---|---|---|---| | 500 | 6.69 | 5.01 | 149.6 | | 2,000 | 4.09 | 4.02 | 56.0 | | 6,000 | 3.72 | 3.70 | 40.4 | | 10,000 | 3.58 | 3.57 | 35.4 | | 12,305 | 3.55 | **3.54** | **34.5** | ## Implementation Built entirely in Julia: - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul - **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). ## Usage ### OpenAI-Compatible API Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM): ```bash curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200, "temperature": 0.8, "top_k": 40 }' ``` ### Load in Julia ```julia using Pkg; Pkg.activate("julia-slm") include("src/JuliaGPT.jl") using .JuliaGPT; using .JuliaGPT: Lux tok = BPETokenizer("vocab.json", "merges.txt") ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device()) model = create_model(ModelConfig(; arch="transformer", vocab_size=vocab_size(tok), embed_dim=256, n_layers=6, n_heads=4, head_dim=64, ffn_mult=4, context_length=256, weight_tying=true, )) text = generate(model, ps, st, tok, "the nature of "; max_new_tokens=200, temperature=0.8, top_k=40) ``` ## Files | File | Description | |---|---| | `final.jld2` | Trained model parameters (JLD2 format) | | `config.toml` | Model architecture configuration | | `vocab.json` | BPE vocabulary (2000 tokens) | | `merges.txt` | BPE merge rules | ## Provenance - **Author**: LisaMegaWatts - **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm) - **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline) - **Training date**: February 2026 - **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl ## Citation ```bibtex @misc{juliaslm2026, title={JuliaSLM: A Small Language Model in Pure Julia}, author={LisaMegaWatts}, year={2026}, url={https://huggingface.co/LisaMegaWatts/JuliaSLM} } ``` ## License MIT