JuliaSLM / README.md

Fix GitHub links: buildwithbooks -> DavinciDreams

eaeccd0 verified 4 days ago

6.41 kB

	---
	language:
	- en
	license: mit
	library_name: lux
	tags:
	- julia
	- lux
	- slm
	- philosophy
	- transformer
	- rope
	- rmsnorm
	- swiglu
	- bpe
	- text-generation
	pipeline_tag: text-generation
	datasets:
	- LisaMegaWatts/philosophy-corpus
	model-index:
	- name: JuliaSLM
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: LisaMegaWatts/philosophy-corpus
	name: philosophy-corpus
	metrics:
	- type: perplexity
	value: 34.5
	name: Val PPL
	- type: loss
	value: 3.54
	name: Val Loss
	---

	# JuliaSLM

	A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

	## Model Family

	JuliaSLM is the baseline Transformer in a family of three architectures trained on the same data with matched parameter budgets:

	\| Model \| Architecture \| Sequence Mixing \| Val PPL \| Params \|
	\|---\|---\|---\|---\|---\|
	\| JuliaSLM \| Transformer \| 4-head causal attention + RoPE \| 34.5 \| 5.04M \|
	\| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) \| Monarch Mixer \| 8-head Monarch matrix + conv + gate \| 38.4 \| 4.98M \|
	\| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) \| Symbiogenesis \| 3 organelles (CausalConv + Monarch + LongConv) + gate \| TBD \| ~4.1M \|

	## Architecture

	```
	JuliaGPTModel (transformer)
	+-- tok_emb: Embedding(2000 -> 256) [weight-tied with output head]
	+-- rope: RotaryPositionalEncoding(64, 256)
	+-- blocks x 6:
	\| +-- ln1: RMSNorm(256)
	\| +-- attn: CausalSelfAttention(4 heads, 64 dim each)
	\| \| +-- wq, wk, wv: Dense(256 -> 256)
	\| \| +-- wo: Dense(256 -> 256)
	\| +-- ln2: RMSNorm(256)
	\| +-- ffn: SwiGLU(256 -> 640 -> 256)
	+-- ln_f: RMSNorm(256)
	+-- head: TiedEmbeddingHead -> (2000,)
	```

	### Key Design Choices

	- RoPE (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
	- RMSNorm (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
	- SwiGLU FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
	- Weight tying: Input embedding and output projection share the same weight matrix, saving 512K parameters
	- No bias: All linear layers use bias=false for parameter efficiency
	- No dropout: Following Karpathy's recommendation for small models

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Total parameters \| 5,037,312 \|
	\| Embedding dim \| 256 \|
	\| Layers \| 6 \|
	\| Attention heads \| 4 \|
	\| Head dim \| 64 \|
	\| FFN hidden dim \| 640 \|
	\| Context length \| 256 tokens \|
	\| Vocabulary \| 2,000 (ByteLevel BPE) \|
	\| Position encoding \| RoPE \|
	\| Weight tying \| Yes \|

	### Parameter Breakdown

	\| Component \| Params \| % \|
	\|---\|---\|---\|
	\| Token embedding (tied) \| 512K \| 10.2% \|
	\| Attention (Q,K,V,O) x 6 \| 1.57M \| 31.2% \|
	\| SwiGLU FFN x 6 \| 2.95M \| 58.5% \|
	\| RMSNorm x 13 \| 3.3K \| <0.1% \|
	\| Total \| 5.04M \| \|

	## Training

	\| \| Value \|
	\|---\|---\|
	\| Dataset \| [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) \|
	\| Corpus \| 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) \|
	\| Train tokens \| ~100M (Chinchilla-optimal: 20 tok/param) \|
	\| Optimizer \| AdamW (lr=6e-4, min_lr=6e-5, cosine decay) \|
	\| Warmup \| 500 steps (linear) \|
	\| Max steps \| 12,305 \|
	\| Batch size \| 32 \|
	\| Gradient clipping \| 1.0 (global norm) \|
	\| Precision \| Float16 AMP (Float32 master weights) \|
	\| Hardware \| NVIDIA RTX 3060 12GB \|
	\| Training time \| 66 minutes \|
	\| Throughput \| ~26K tok/s \|

	### Training Curves

	\| Step \| Train Loss \| Val Loss \| Val PPL \|
	\|---\|---\|---\|---\|
	\| 500 \| 6.69 \| 5.01 \| 149.6 \|
	\| 2,000 \| 4.09 \| 4.02 \| 56.0 \|
	\| 6,000 \| 3.72 \| 3.70 \| 40.4 \|
	\| 10,000 \| 3.58 \| 3.57 \| 35.4 \|
	\| 12,305 \| 3.55 \| 3.54 \| 34.5 \|

	## Implementation

	Built entirely in Julia:

	- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural network framework
	- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation
	- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration
	- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — Softmax, activations, batched_mul
	- [Optimisers.jl](https://github.com/FluxML/Optimisers.jl) — AdamW with cosine LR

	Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

	## Usage

	### OpenAI-Compatible API

	Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):

	```bash
	curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [{"role": "user", "content": "the nature of"}],
	"max_tokens": 200,
	"temperature": 0.8,
	"top_k": 40
	}'
	```

	### Load in Julia

	```julia
	using Pkg; Pkg.activate("julia-slm")
	include("src/JuliaGPT.jl")
	using .JuliaGPT; using .JuliaGPT: Lux

	tok = BPETokenizer("vocab.json", "merges.txt")
	ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

	model = create_model(ModelConfig(;
	arch="transformer", vocab_size=vocab_size(tok),
	embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
	ffn_mult=4, context_length=256, weight_tying=true,
	))

	text = generate(model, ps, st, tok, "the nature of ";
	max_new_tokens=200, temperature=0.8, top_k=40)
	```

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `final.jld2` \| Trained model parameters (JLD2 format) \|
	\| `config.toml` \| Model architecture configuration \|
	\| `vocab.json` \| BPE vocabulary (2000 tokens) \|
	\| `merges.txt` \| BPE merge rules \|

	## Provenance

	- Author: LisaMegaWatts
	- Training code: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
	- Data pipeline: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
	- Training date: February 2026
	- Architecture reference: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl

	## Citation

	```bibtex
	@misc{juliaslm2026,
	title={JuliaSLM: A Small Language Model in Pure Julia},
	author={LisaMegaWatts},
	year={2026},
	url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
	}
	```

	## License

	MIT