JuliaSLM / README.md
LisaMegaWatts's picture
Fix GitHub links: buildwithbooks -> DavinciDreams
eaeccd0 verified
metadata
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - transformer
  - rope
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: JuliaSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 34.5
            name: Val PPL
          - type: loss
            value: 3.54
            name: Val Loss

JuliaSLM

A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the Julia SLM family of models exploring alternative sequence mixing architectures.

Model Family

JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:

Model Architecture Sequence Mixing Val PPL Params
JuliaSLM Transformer 4-head causal attention + RoPE 34.5 5.04M
MonarchSLM Monarch Mixer 8-head Monarch matrix + conv + gate 38.4 4.98M
SymbioSLM Symbiogenesis 3 organelles (CausalConv + Monarch + LongConv) + gate TBD ~4.1M

Architecture

JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
|   +-- ln1: RMSNorm(256)
|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
|   |   +-- wq, wk, wv: Dense(256 -> 256)
|   |   +-- wo: Dense(256 -> 256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)

Key Design Choices

  • RoPE (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
  • RMSNorm (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
  • SwiGLU FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
  • Weight tying: Input embedding and output projection share the same weight matrix, saving 512K parameters
  • No bias: All linear layers use bias=false for parameter efficiency
  • No dropout: Following Karpathy's recommendation for small models

Model Details

Parameter Value
Total parameters 5,037,312
Embedding dim 256
Layers 6
Attention heads 4
Head dim 64
FFN hidden dim 640
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
Position encoding RoPE
Weight tying Yes

Parameter Breakdown

Component Params %
Token embedding (tied) 512K 10.2%
Attention (Q,K,V,O) x 6 1.57M 31.2%
SwiGLU FFN x 6 2.95M 58.5%
RMSNorm x 13 3.3K <0.1%
Total 5.04M

Training

Value
Dataset philosophy-corpus
Corpus 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...)
Train tokens ~100M (Chinchilla-optimal: 20 tok/param)
Optimizer AdamW (lr=6e-4, min_lr=6e-5, cosine decay)
Warmup 500 steps (linear)
Max steps 12,305
Batch size 32
Gradient clipping 1.0 (global norm)
Precision Float16 AMP (Float32 master weights)
Hardware NVIDIA RTX 3060 12GB
Training time 66 minutes
Throughput ~26K tok/s

Training Curves

Step Train Loss Val Loss Val PPL
500 6.69 5.01 149.6
2,000 4.09 4.02 56.0
6,000 3.72 3.70 40.4
10,000 3.58 3.57 35.4
12,305 3.55 3.54 34.5

Implementation

Built entirely in Julia:

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

Usage

OpenAI-Compatible API

Served via JuliaSLM Space:

curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Load in Julia

using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="transformer", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)

Files

File Description
final.jld2 Trained model parameters (JLD2 format)
config.toml Model architecture configuration
vocab.json BPE vocabulary (2000 tokens)
merges.txt BPE merge rules

Provenance

Citation

@misc{juliaslm2026,
  title={JuliaSLM: A Small Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}

License

MIT