| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: lux |
| | tags: |
| | - julia |
| | - lux |
| | - slm |
| | - philosophy |
| | - transformer |
| | - rope |
| | - rmsnorm |
| | - swiglu |
| | - bpe |
| | - text-generation |
| | pipeline_tag: text-generation |
| | datasets: |
| | - LisaMegaWatts/philosophy-corpus |
| | model-index: |
| | - name: JuliaSLM |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Text Generation |
| | dataset: |
| | type: LisaMegaWatts/philosophy-corpus |
| | name: philosophy-corpus |
| | metrics: |
| | - type: perplexity |
| | value: 34.5 |
| | name: Val PPL |
| | - type: loss |
| | value: 3.54 |
| | name: Val Loss |
| | --- |
| | |
| | # JuliaSLM |
| |
|
| | A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures. |
| |
|
| | ## Model Family |
| |
|
| | JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets: |
| |
|
| | | Model | Architecture | Sequence Mixing | Val PPL | Params | |
| | |---|---|---|---|---| |
| | | **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M | |
| | | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M | |
| | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M | |
| |
|
| | ## Architecture |
| |
|
| | ``` |
| | JuliaGPTModel (transformer) |
| | +-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head] |
| | +-- rope: RotaryPositionalEncoding(64, 256) |
| | +-- blocks x 6: |
| | | +-- ln1: RMSNorm(256) |
| | | +-- attn: CausalSelfAttention(4 heads, 64 dim each) |
| | | | +-- wq, wk, wv: Dense(256 -> 256) |
| | | | +-- wo: Dense(256 -> 256) |
| | | +-- ln2: RMSNorm(256) |
| | | +-- ffn: SwiGLU(256 -> 640 -> 256) |
| | +-- ln_f: RMSNorm(256) |
| | +-- head: TiedEmbeddingHead -> (2000,) |
| | ``` |
| |
|
| | ### Key Design Choices |
| |
|
| | - **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization |
| | - **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer |
| | - **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64 |
| | - **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters |
| | - **No bias**: All linear layers use bias=false for parameter efficiency |
| | - **No dropout**: Following Karpathy's recommendation for small models |
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |---|---| |
| | | Total parameters | 5,037,312 | |
| | | Embedding dim | 256 | |
| | | Layers | 6 | |
| | | Attention heads | 4 | |
| | | Head dim | 64 | |
| | | FFN hidden dim | 640 | |
| | | Context length | 256 tokens | |
| | | Vocabulary | 2,000 (ByteLevel BPE) | |
| | | Position encoding | RoPE | |
| | | Weight tying | Yes | |
| |
|
| | ### Parameter Breakdown |
| |
|
| | | Component | Params | % | |
| | |---|---|---| |
| | | Token embedding (tied) | 512K | 10.2% | |
| | | Attention (Q,K,V,O) x 6 | 1.57M | 31.2% | |
| | | SwiGLU FFN x 6 | 2.95M | 58.5% | |
| | | RMSNorm x 13 | 3.3K | <0.1% | |
| | | **Total** | **5.04M** | | |
| |
|
| | ## Training |
| |
|
| | | | Value | |
| | |---|---| |
| | | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | |
| | | Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) | |
| | | Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) | |
| | | Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) | |
| | | Warmup | 500 steps (linear) | |
| | | Max steps | 12,305 | |
| | | Batch size | 32 | |
| | | Gradient clipping | 1.0 (global norm) | |
| | | Precision | Float16 AMP (Float32 master weights) | |
| | | Hardware | NVIDIA RTX 3060 12GB | |
| | | Training time | 66 minutes | |
| | | Throughput | ~26K tok/s | |
| | |
| | ### Training Curves |
| | |
| | | Step | Train Loss | Val Loss | Val PPL | |
| | |---|---|---|---| |
| | | 500 | 6.69 | 5.01 | 149.6 | |
| | | 2,000 | 4.09 | 4.02 | 56.0 | |
| | | 6,000 | 3.72 | 3.70 | 40.4 | |
| | | 10,000 | 3.58 | 3.57 | 35.4 | |
| | | 12,305 | 3.55 | **3.54** | **34.5** | |
| | |
| | ## Implementation |
| | |
| | Built entirely in Julia: |
| | |
| | - **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework |
| | - **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation |
| | - **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration |
| | - **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul |
| | - **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR |
| |
|
| | Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime). |
| |
|
| | ## Usage |
| |
|
| | ### OpenAI-Compatible API |
| |
|
| | Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM): |
| |
|
| | ```bash |
| | curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \ |
| | -H "Content-Type: application/json" \ |
| | -d '{ |
| | "messages": [{"role": "user", "content": "the nature of"}], |
| | "max_tokens": 200, |
| | "temperature": 0.8, |
| | "top_k": 40 |
| | }' |
| | ``` |
| |
|
| | ### Load in Julia |
| |
|
| | ```julia |
| | using Pkg; Pkg.activate("julia-slm") |
| | include("src/JuliaGPT.jl") |
| | using .JuliaGPT; using .JuliaGPT: Lux |
| | |
| | tok = BPETokenizer("vocab.json", "merges.txt") |
| | ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device()) |
| | |
| | model = create_model(ModelConfig(; |
| | arch="transformer", vocab_size=vocab_size(tok), |
| | embed_dim=256, n_layers=6, n_heads=4, head_dim=64, |
| | ffn_mult=4, context_length=256, weight_tying=true, |
| | )) |
| | |
| | text = generate(model, ps, st, tok, "the nature of "; |
| | max_new_tokens=200, temperature=0.8, top_k=40) |
| | ``` |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |---|---| |
| | | `final.jld2` | Trained model parameters (JLD2 format) | |
| | | `config.toml` | Model architecture configuration | |
| | | `vocab.json` | BPE vocabulary (2000 tokens) | |
| | | `merges.txt` | BPE merge rules | |
| |
|
| | ## Provenance |
| |
|
| | - **Author**: LisaMegaWatts |
| | - **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm) |
| | - **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline) |
| | - **Training date**: February 2026 |
| | - **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{juliaslm2026, |
| | title={JuliaSLM: A Small Language Model in Pure Julia}, |
| | author={LisaMegaWatts}, |
| | year={2026}, |
| | url={https://huggingface.co/LisaMegaWatts/JuliaSLM} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|