File size: 6,413 Bytes
98ca52c
 
d05796a
98da2a6
d05796a
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98ca52c
 
d05796a
98ca52c
eaeccd0
98ca52c
d05796a
98da2a6
d05796a
98ca52c
d05796a
 
 
 
 
a90b172
d05796a
a90b172
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
a90b172
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
98ca52c
 
d05796a
 
 
98ca52c
d05796a
 
98da2a6
a90b172
d05796a
 
 
 
 
 
 
 
 
 
 
a90b172
d05796a
a90b172
d05796a
 
 
 
 
 
a90b172
d05796a
 
 
 
a90b172
 
d05796a
a90b172
 
d05796a
a90b172
 
 
 
d05796a
a90b172
d05796a
a90b172
d05796a
a90b172
d05796a
 
 
 
 
a90b172
d05796a
a90b172
d05796a
98ca52c
d05796a
98ca52c
d05796a
98ca52c
 
 
 
d05796a
 
 
 
 
 
 
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98da2a6
98ca52c
d05796a
a90b172
d05796a
 
 
 
 
 
a90b172
d05796a
a90b172
d05796a
eaeccd0
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
98ca52c
d05796a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - transformer
  - rope
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: JuliaSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 34.5
            name: Val PPL
          - type: loss
            value: 3.54
            name: Val Loss
---

# JuliaSLM

A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

## Model Family

JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:

| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |

## Architecture

```
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
|   +-- ln1: RMSNorm(256)
|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
|   |   +-- wq, wk, wv: Dense(256 -> 256)
|   |   +-- wo: Dense(256 -> 256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```

### Key Design Choices

- **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
- **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
- **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
- **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
- **No bias**: All linear layers use bias=false for parameter efficiency
- **No dropout**: Following Karpathy's recommendation for small models

## Model Details

| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |

### Parameter Breakdown

| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| **Total** | **5.04M** | |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |

### Training Curves

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | **3.54** | **34.5** |

## Implementation

Built entirely in Julia:

- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
- **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

## Usage

### OpenAI-Compatible API

Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):

```bash
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

### Load in Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="transformer", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
```

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Provenance

- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl

## Citation

```bibtex
@misc{juliaslm2026,
  title={JuliaSLM: A Small Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
```

## License

MIT