LisaMegaWatts
/

julia-slm

+---
+license: mit
+language:
+- en
+tags:
+- julia
+- lux
+- transformer
+- language-model
+- chinchilla
+- bpe
+datasets:
+- LisaMegaWatts/philosophy-corpus
+pipeline_tag: text-generation
+---
+# Julia SLM — Small Language Models in Pure Julia
+Transformer language models built entirely in Julia using [Lux.jl](https://github.com/LuxDL/Lux.jl), trained on the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
+## Models
+### 5M Chinchilla (`5m-chinchilla/`)
+A 5.04M parameter transformer trained to Chinchilla-optimal (100M tokens at 20 tokens/param).
+| Param | Value |
+|-------|-------|
+| Parameters | 5,037,312 |
+| Architecture | Decoder-only Transformer |
+| Embedding dim | 256 |
+| Layers | 6 |
+| Attention heads | 4 |
+| Head dim | 64 |
+| FFN multiplier | 4x (SwiGLU) |
+| Context length | 256 |
+| Vocab size | 2,000 (BPE) |
+| Weight tying | Yes |
+| Normalization | RMSNorm (pre-norm) |
+| Positional encoding | RoPE |
+| Bias | None |
+**Training details:**
+| Metric | Value |
+|--------|-------|
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
+| Schedule | Cosine decay with 500-step warmup |
+| Batch size | 32 |
+| Training steps | 12,305 |
+| Tokens processed | ~100M |
+| Training time | 66 min on RTX 3060 12GB |
+| Throughput | ~26K tok/s |
+| Final val loss | 3.54 |
+| Final val PPL | 34.5 |
+**Loss curve:**
+| Step | Train Loss | Val Loss | Val PPL |
+|------|-----------|----------|---------|
+| 500 | 6.69 | 5.01 | 149.6 |
+| 2,000 | 4.09 | 4.02 | 56.0 |
+| 6,000 | 3.72 | 3.70 | 40.4 |
+| 10,000 | 3.58 | 3.57 | 35.4 |
+| 12,305 | 3.55 | 3.54 | 34.5 |
+## Architecture
+```
+JuliaGPTModel
+├── tok_emb: Embedding(2000 → 256)     # weight-tied with output head
+├── rope: RotaryPositionalEncoding(256)
+├── blocks × 6:
+│   ├── ln1: RMSNorm(256)
+│   ├── attn: MultiHeadAttention(4 heads, 64 dim each)
+│   │   ├── wq, wk, wv: Dense(256 → 256)
+│   │   └── wo: Dense(256 → 256)
+│   ├── ln2: RMSNorm(256)
+│   └── ffn: SwiGLU(256 → 1024 → 256)
+│       ├── w1: Dense(256 → 1024)  # gate
+│       ├── v:  Dense(256 → 1024)  # value
+│       └── w2: Dense(1024 → 256)  # down-project
+├── ln_f: RMSNorm(256)
+└── head: TiedEmbeddingHead → (2000,)  # shares tok_emb weights
+```
+## Usage
+### Load and generate
+```julia
+using Pkg; Pkg.activate("julia-slm")
+include("src/JuliaGPT.jl")
+using .JuliaGPT
+using .JuliaGPT: Lux, CUDA, LuxCUDA
+# Load tokenizer
+tok = BPETokenizer("path/to/vocab.json", "path/to/merges.txt")
+# Load checkpoint
+device = Lux.gpu_device()  # or Lux.cpu_device()
+ps, st, _, step, val_loss = load_checkpoint("5m-chinchilla/final.jld2"; device)
+# Create model (must match checkpoint architecture)
+model = create_model(ModelConfig(;
+    vocab_size=vocab_size(tok), embed_dim=256, n_layers=6,
+    n_heads=4, head_dim=64, ffn_mult=4, context_length=256,
+    weight_tying=true,
+))
+# Generate
+text = generate(model, ps, st, tok, "the nature of ";
+               max_new_tokens=200, temperature=0.8, top_k=40)
+println(text)
+```
+### Resume training
+```bash
+julia --project scripts/train.jl --config config/5m.toml --resume 5m-chinchilla/final.jld2
+```
+## Dataset
+Trained on [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) — a curated collection of 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
+- **Train tokens**: 794.9M (pre-encoded as `train.bin`)
+- **Val tokens**: 88.2M (pre-encoded as `val.bin`)
+- **Tokenizer**: ByteLevel BPE, 2,000 vocab (also available: 4K variant)
+## Framework
+Built with:
+- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks
+- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation
+- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration
+- [Optimisers.jl](https://github.com/FluxML/Optimisers.jl) — AdamW with cosine LR
+- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — Softmax, activations
+- [OneHotArrays.jl](https://github.com/FluxML/OneHotArrays.jl) — GPU-compatible cross-entropy
+## Files
+```
+5m-chinchilla/
+├── config.toml          # Training config (TOML)
+├── final.jld2           # Final checkpoint (step 12305)
+└── step_12000.jld2      # Intermediate checkpoint
+```
+Checkpoints are saved in JLD2 format and contain: model parameters (`ps`), model state (`st`), optimizer state, step number, and best validation loss.
+## License
+MIT