LisaMegaWatts
/

JuliaSLM

@@ -1,148 +1,209 @@
 ---
 language:
-- en
-library_name: julia
 license: mit
-pipeline_tag: text-generation
 tags:
-- philosophy
-- classical-texts
-- julia
-- lux
-- bpe
-- rope
-- rmsnorm
-- swiglu
-- small-language-model
-- openai-compatible
-- chinchilla
 datasets:
-- LisaMegaWatts/philosophy-corpus
 ---
-# JuliaSLM — Inference Server Artifacts
-Serving-ready artifacts for the [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM), an OpenAI-compatible inference endpoint for the 5M parameter JuliaSLM transformer.
-For full training details, loss curves, architecture diagrams, and code examples see the canonical model repo: **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)**.
-## Model Summary
-A 5,037,312 parameter decoder-only transformer trained to Chinchilla-optimal (100M tokens at 20 tokens/param) on classical philosophy and liberal arts texts.
-### Architecture
 ```
-JuliaGPTModel
-├── tok_emb: Embedding(2000 → 256)     # weight-tied with output head
-├── rope: RotaryPositionalEncoding(64)
-├── blocks × 6:
-│   ├── ln1: RMSNorm(256)
-│   ├── attn: MultiHeadAttention(4 heads, 64 dim each)
-│   │   ├── wq, wk, wv: Dense(256 → 256)
-│   │   └── wo: Dense(256 → 256)
-│   ├── ln2: RMSNorm(256)
-│   └── ffn: SwiGLU(256 → 1024 → 256)
-│       ├── w1: Dense(256 → 1024)  # gate
-│       ├── v:  Dense(256 → 1024)  # value
-│       └── w2: Dense(1024 → 256)  # down-project
-├── ln_f: RMSNorm(256)
-└── head: TiedEmbeddingHead → (2000,)  # shares tok_emb weights
 ```
-| Component | Detail |
 |---|---|
-| Parameters | 5,037,312 |
 | Embedding dim | 256 |
 | Layers | 6 |
-| Attention heads | 4 (head dim 64) |
-| FFN multiplier | 4x (SwiGLU, hidden 1024) |
 | Context length | 256 tokens |
-| Positional encoding | Rotary (RoPE) |
-| Normalization | RMSNorm (pre-norm) |
 | Weight tying | Yes |
-| Bias | None |
-### Training
-| Metric | Value |
 |---|---|
-| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, wd=0.1) |
-| Schedule | Cosine decay with 500-step warmup |
-| Precision | Mixed F16/F32 |
 | Batch size | 32 |
-| Training steps | 12,305 |
-| Tokens processed | ~100M |
-| Training time | 66 min on RTX 3060 12GB |
 | Throughput | ~26K tok/s |
-| Final val loss | 3.54 |
-| Final val PPL | 34.5 |
-### Loss Curve
 | Step | Train Loss | Val Loss | Val PPL |
-|------|-----------|----------|---------|
 | 500 | 6.69 | 5.01 | 149.6 |
 | 2,000 | 4.09 | 4.02 | 56.0 |
 | 6,000 | 3.72 | 3.70 | 40.4 |
 | 10,000 | 3.58 | 3.57 | 35.4 |
-| 12,305 | 3.55 | 3.54 | 34.5 |
-### Tokenizer
-ByteLevel BPE with 2,000 subword tokens, trained on the philosophy corpus. Tokenizer files (`vocab.json`, `merges.txt`) are sourced from the [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) dataset.
-### Training Data
-[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) — 981 source texts (BookCorpus, WikiText-103, PG-19, classical philosophy) processed through a custom text pipeline with deduplication and quality scoring.
-- **Train tokens**: 794.9M (pre-encoded as `train.bin`)
-- **Val tokens**: 88.2M (pre-encoded as `val.bin`)
-- **Sources**: Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Euclid, Kant, Spinoza, Nietzsche, and more
-## Files
-| File | Description |
-|---|---|
-| `final.jld2` | Model parameters (JLD2 format, 58MB) |
-| `config.toml` | Architecture config (5m-chinchilla) |
-| `vocab.json` | BPE vocabulary (2000 tokens, dict format) |
-| `merges.txt` | BPE merge rules |
-## Inference API
-The [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM) serves this model via an OpenAI-compatible API with SSE streaming, temperature, top-k, and top-p sampling. CPU-only inference using pure NNlib (no Lux dependency at runtime).
 ```bash
-# Streaming
 curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.8, "top_k": 40}'
-# Non-streaming
-curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
 ```
-### Endpoints
-- `GET /` — Health check and model info
-- `GET /v1/models` — List available models
-- `POST /v1/chat/completions` — Generate text (streaming + non-streaming)
-## Framework
-Built with:
-- [Lux.jl](https://github.com/LuxDL/Lux.jl) — Explicit-parameter neural networks (training)
-- [NNlib.jl](https://github.com/FluxML/NNlib.jl) — Softmax, activations (inference)
-- [Zygote.jl](https://github.com/FluxML/Zygote.jl) — Automatic differentiation (training)
-- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) — GPU acceleration (training)
-## Related
-- **[LisaMegaWatts/julia-slm](https://huggingface.co/LisaMegaWatts/julia-slm)** — Canonical model repo (versioned checkpoints, full docs)
-- **[JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM)** — Live inference endpoint
-- **[LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)** — Training dataset + tokenizer
-- **[LisaMegaWatts/JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT)** — Predecessor (~5K params, character-level, scalar autograd)
-- **[Source code](https://github.com/DavinciDreams/JuliaGPT)** — GitHub repository

 ---
 language:
+  - en
 license: mit
+library_name: lux
 tags:
+  - julia
+  - lux
+  - slm
+  - philosophy
+  - transformer
+  - rope
+  - rmsnorm
+  - swiglu
+  - bpe
+  - text-generation
+pipeline_tag: text-generation
 datasets:
+  - LisaMegaWatts/philosophy-corpus
+model-index:
+  - name: JuliaSLM
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        dataset:
+          type: LisaMegaWatts/philosophy-corpus
+          name: philosophy-corpus
+        metrics:
+          - type: perplexity
+            value: 34.5
+            name: Val PPL
+          - type: loss
+            value: 3.54
+            name: Val Loss
 ---
+# JuliaSLM
+A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/buildwithbooks/julia-slm) family of models exploring alternative sequence mixing architectures.
+## Model Family
+JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:
+| Model | Architecture | Sequence Mixing | Val PPL | Params |
+|---|---|---|---|---|
+| **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
+| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
+| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |
+## Architecture
 ```
+JuliaGPTModel (transformer)
++-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
++-- rope: RotaryPositionalEncoding(64, 256)
++-- blocks x 6:
+|   +-- ln1: RMSNorm(256)
+|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
+|   |   +-- wq, wk, wv: Dense(256 -> 256)
+|   |   +-- wo: Dense(256 -> 256)
+|   +-- ln2: RMSNorm(256)
+|   +-- ffn: SwiGLU(256 -> 640 -> 256)
++-- ln_f: RMSNorm(256)
++-- head: TiedEmbeddingHead -> (2000,)
 ```
+### Key Design Choices
+- **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
+- **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
+- **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
+- **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
+- **No bias**: All linear layers use bias=false for parameter efficiency
+- **No dropout**: Following Karpathy's recommendation for small models
+## Model Details
+| Parameter | Value |
 |---|---|
+| Total parameters | 5,037,312 |
 | Embedding dim | 256 |
 | Layers | 6 |
+| Attention heads | 4 |
+| Head dim | 64 |
+| FFN hidden dim | 640 |
 | Context length | 256 tokens |
+| Vocabulary | 2,000 (ByteLevel BPE) |
+| Position encoding | RoPE |
 | Weight tying | Yes |
+### Parameter Breakdown
+| Component | Params | % |
+|---|---|---|
+| Token embedding (tied) | 512K | 10.2% |
+| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
+| SwiGLU FFN x 6 | 2.95M | 58.5% |
+| RMSNorm x 13 | 3.3K | <0.1% |
+| **Total** | **5.04M** | |
+## Training
+| | Value |
 |---|---|
+| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
+| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
+| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
+| Warmup | 500 steps (linear) |
+| Max steps | 12,305 |
 | Batch size | 32 |
+| Gradient clipping | 1.0 (global norm) |
+| Precision | Float16 AMP (Float32 master weights) |
+| Hardware | NVIDIA RTX 3060 12GB |
+| Training time | 66 minutes |
 | Throughput | ~26K tok/s |
+### Training Curves
 | Step | Train Loss | Val Loss | Val PPL |
+|---|---|---|---|
 | 500 | 6.69 | 5.01 | 149.6 |
 | 2,000 | 4.09 | 4.02 | 56.0 |
 | 6,000 | 3.72 | 3.70 | 40.4 |
 | 10,000 | 3.58 | 3.57 | 35.4 |
+| 12,305 | 3.55 | **3.54** | **34.5** |
+## Implementation
+Built entirely in Julia:
+- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
+- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
+- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
+- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
+- **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR
+Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).
+## Usage
+### OpenAI-Compatible API
+Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):
 ```bash
 curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "the nature of"}],
+    "max_tokens": 200,
+    "temperature": 0.8,
+    "top_k": 40
+  }'
+```
+### Load in Julia
+```julia
+using Pkg; Pkg.activate("julia-slm")
+include("src/JuliaGPT.jl")
+using .JuliaGPT; using .JuliaGPT: Lux
+tok = BPETokenizer("vocab.json", "merges.txt")
+ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())
+model = create_model(ModelConfig(;
+    arch="transformer", vocab_size=vocab_size(tok),
+    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
+    ffn_mult=4, context_length=256, weight_tying=true,
+))
+text = generate(model, ps, st, tok, "the nature of ";
+    max_new_tokens=200, temperature=0.8, top_k=40)
 ```
+## Files
+| File | Description |
+|---|---|
+| `final.jld2` | Trained model parameters (JLD2 format) |
+| `config.toml` | Model architecture configuration |
+| `vocab.json` | BPE vocabulary (2000 tokens) |
+| `merges.txt` | BPE merge rules |
+## Provenance
+- **Author**: LisaMegaWatts
+- **Training code**: [buildwithbooks/julia-slm](https://github.com/buildwithbooks/julia-slm)
+- **Data pipeline**: [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline)
+- **Training date**: February 2026
+- **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl
+## Citation
+```bibtex
+@misc{juliaslm2026,
+  title={JuliaSLM: A Small Language Model in Pure Julia},
+  author={LisaMegaWatts},
+  year={2026},
+  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
+}
+```
+## License
+MIT