File size: 6,413 Bytes

98ca52c
 
d05796a
98da2a6
d05796a
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98ca52c
 
d05796a
98ca52c
eaeccd0
98ca52c
d05796a
98da2a6
d05796a
98ca52c
d05796a
 
 
 
 
a90b172
d05796a
a90b172
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
a90b172
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
98ca52c
 
d05796a
 
 
98ca52c
d05796a
 
98da2a6
a90b172
d05796a
 
 
 
 
 
 
 
 
 
 
a90b172
d05796a
a90b172
d05796a
 
 
 
 
 
a90b172
d05796a
 
 
 
a90b172
 
d05796a
a90b172
 
d05796a
a90b172
 
 
 
d05796a
a90b172
d05796a
a90b172
d05796a
a90b172
d05796a
 
 
 
 
a90b172
d05796a
a90b172
d05796a
98ca52c
d05796a
98ca52c
d05796a
98ca52c
 
 
 
d05796a
 
 
 
 
 
 
98ca52c
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98da2a6
98ca52c
d05796a
a90b172
d05796a
 
 
 
 
 
a90b172
d05796a
a90b172
d05796a
eaeccd0
 
d05796a
 
 
 
 
 
 
 
 
 
 
 
 
98ca52c
d05796a
98ca52c
d05796a

---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - transformer
  - rope
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
datasets:
  - LisaMegaWatts/philosophy-corpus
model-index:
  - name: JuliaSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 34.5
            name: Val PPL
          - type: loss
            value: 3.54
            name: Val Loss
---

# JuliaSLM

A 5.04M parameter decoder-only Transformer trained on classical philosophy texts, implemented entirely in Julia using Lux.jl. Part of the [Julia SLM](https://github.com/DavinciDreams/julia-slm) family of models exploring alternative sequence mixing architectures.

## Model Family

JuliaSLM is the **baseline Transformer** in a family of three architectures trained on the same data with matched parameter budgets:

| Model | Architecture | Sequence Mixing | Val PPL | Params |
|---|---|---|---|---|
| **JuliaSLM** | Transformer | 4-head causal attention + RoPE | **34.5** | 5.04M |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Monarch Mixer | 8-head Monarch matrix + conv + gate | 38.4 | 4.98M |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Symbiogenesis | 3 organelles (CausalConv + Monarch + LongConv) + gate | TBD | ~4.1M |

## Architecture

```
JuliaGPTModel (transformer)
+-- tok_emb: Embedding(2000 -> 256)     [weight-tied with output head]
+-- rope: RotaryPositionalEncoding(64, 256)
+-- blocks x 6:
|   +-- ln1: RMSNorm(256)
|   +-- attn: CausalSelfAttention(4 heads, 64 dim each)
|   |   +-- wq, wk, wv: Dense(256 -> 256)
|   |   +-- wo: Dense(256 -> 256)
|   +-- ln2: RMSNorm(256)
|   +-- ffn: SwiGLU(256 -> 640 -> 256)
+-- ln_f: RMSNorm(256)
+-- head: TiedEmbeddingHead -> (2000,)
```

### Key Design Choices

- **RoPE** (Rotary Position Embeddings): Relative position encoding applied to Q and K in each attention head, enabling length generalization
- **RMSNorm** (pre-norm): Root Mean Square normalization without learnable bias, applied before each sublayer
- **SwiGLU** FFN: Gated linear unit with Swish activation; hidden dim adjusted by 2/3 factor and rounded to nearest multiple of 64
- **Weight tying**: Input embedding and output projection share the same weight matrix, saving 512K parameters
- **No bias**: All linear layers use bias=false for parameter efficiency
- **No dropout**: Following Karpathy's recommendation for small models

## Model Details

| Parameter | Value |
|---|---|
| Total parameters | 5,037,312 |
| Embedding dim | 256 |
| Layers | 6 |
| Attention heads | 4 |
| Head dim | 64 |
| FFN hidden dim | 640 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE |
| Weight tying | Yes |

### Parameter Breakdown

| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 10.2% |
| Attention (Q,K,V,O) x 6 | 1.57M | 31.2% |
| SwiGLU FFN x 6 | 2.95M | 58.5% |
| RMSNorm x 13 | 3.3K | <0.1% |
| **Total** | **5.04M** | |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, cosine decay) |
| Warmup | 500 steps (linear) |
| Max steps | 12,305 |
| Batch size | 32 |
| Gradient clipping | 1.0 (global norm) |
| Precision | Float16 AMP (Float32 master weights) |
| Hardware | NVIDIA RTX 3060 12GB |
| Training time | 66 minutes |
| Throughput | ~26K tok/s |

### Training Curves

| Step | Train Loss | Val Loss | Val PPL |
|---|---|---|---|
| 500 | 6.69 | 5.01 | 149.6 |
| 2,000 | 4.09 | 4.02 | 56.0 |
| 6,000 | 3.72 | 3.70 | 40.4 |
| 10,000 | 3.58 | 3.57 | 35.4 |
| 12,305 | 3.55 | **3.54** | **34.5** |

## Implementation

Built entirely in Julia:

- **[Lux.jl](https://github.com/LuxDL/Lux.jl)** — Explicit-parameter neural network framework
- **[Zygote.jl](https://github.com/FluxML/Zygote.jl)** — Automatic differentiation
- **[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)** — GPU acceleration
- **[NNlib.jl](https://github.com/FluxML/NNlib.jl)** — Softmax, activations, batched_mul
- **[Optimisers.jl](https://github.com/FluxML/Optimisers.jl)** — AdamW with cosine LR

Inference runs on CPU using pure NNlib operations (no Lux dependency at runtime).

## Usage

### OpenAI-Compatible API

Served via [JuliaSLM Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaSLM):

```bash
curl -X POST https://lisamegawatts-juliaslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

### Load in Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT; using .JuliaGPT: Lux

tok = BPETokenizer("vocab.json", "merges.txt")
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device=Lux.cpu_device())

model = create_model(ModelConfig(;
    arch="transformer", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
```

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Provenance

- **Author**: LisaMegaWatts
- **Training code**: [DavinciDreams/julia-slm](https://github.com/DavinciDreams/julia-slm)
- **Data pipeline**: [DavinciDreams/text-pipeline](https://github.com/DavinciDreams/text-pipeline)
- **Training date**: February 2026
- **Architecture reference**: nanoGPT (Karpathy, 2023) adapted for Julia/Lux.jl

## Citation

```bibtex
@misc{juliaslm2026,
  title={JuliaSLM: A Small Language Model in Pure Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaSLM}
}
```

## License

MIT