JuliaSLM / README.md
LisaMegaWatts's picture
Fix GitHub links: buildwithbooks -> DavinciDreams
eaeccd0 verified
---
language:
- en
license: mit
library_name: lux
tags:
- julia
- lux
- slm
- philosophy
- transformer
- rope
- rmsnorm
- swiglu
- bpe
- text-generation
pipeline_tag: text-generation
datasets:
- LisaMegaWatts/philosophy-corpus
model-index:
- name: JuliaSLM
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: LisaMegaWatts/philosophy-corpus
name: philosophy-corpus
metrics:
- type: perplexity
value: 34.5
name: Val PPL
- type: loss
value: 3.54
name: Val Loss
---
# JuliaSLM
A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.
## Model Family
JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:
| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
## Architecture
```
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
| +-- ln1: RMSNorm(256)
| +-- attn: CausalSelfAttention(4 heads, 64 dim each)
| | +-- wq, wk, wv: Dense(256 -> 256)
| | +-- wo: Dense(256 -> 256)
| +-- ln2: RMSNorm(256)
| +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```
### Key Design Choices
- **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
- **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
- **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
- **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
- **No bias**: All linear layers use bias=false for parameter efficiency
- **No dropout**: Following Karpathy's recommendation for small models
## Model Details
| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |
### Parameter Breakdown
| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| **Total** | **5.04M** | |
## Training
| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |
### Training Curves
| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | **3.54** | **34.5** |
## Implementation
Built entirely in Julia:
- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
- **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR
Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
## Usage
### OpenAI-Compatible API
Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):
```bash
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
```
### Load in Julia
```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux
tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
model = create_model(ModelConfig(;
arch="transformer", vocab_size=vocab_size(tok),
embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
ffn_mult=4, context_length=256, weight_tying=true,
))
text = generate(model, ps, st, tok, "the nature of ";
max_new_tokens=200, temperature=0.8, top_k=40)
```
## Files
| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |
## Provenance
- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl
## Citation
```bibtex
@misc{juliaslm2026,
title={JuliaSLM: A Small Language Model in Pure Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
```
## License
MIT