JuliaFluxGPT-1M
A 1.01M parameter LLaMA-style decoder-only transformer trained on the curated philosophy corpus.
Scaling law data point for the JuliaFluxGPT model family.
Model Details
| Parameter |
Value |
| Parameters |
1,010,816 (1.01M) |
| Architecture |
LLaMA-style: RMSNorm, SwiGLU, RoPE, weight-tied output |
| d_model |
128 |
| Layers |
4 |
| Attention |
4Q/4KV (full MHA), head_dim=32 |
| FFN |
SwiGLU (inner_dim=192) |
| Vocab |
2000 BPE tokens |
| Context |
256 tokens |
| Precision |
bf16 |
Training
| Detail |
Value |
| Dataset |
Curated philosophy corpus (266M tokens) |
| Tokenizer |
BPE, vocab=2000 |
| Steps |
4,932 (4Γ Chinchilla budget) |
| Tokens seen |
80M (80:1 tokens/param) |
| Batch size |
64 |
| Optimizer |
AdamW (Ξ²1=0.9, Ξ²2=0.95, wd=0.1) |
| LR |
1e-3 β 1e-4 (cosine, 200 warmup) |
| Hardware |
NVIDIA T4 16GB, Colab |
| Throughput |
~25K tok/s |
| W&B |
p7yt1too |
Training Curve
| Step |
Val Loss |
Val PPL |
| 250 |
6.539 |
691.5 |
| 500 |
5.445 |
231.6 |
| 1000 |
4.957 |
142.2 |
| 1250 |
4.851 |
127.8 (Chinchilla boundary) |
| 2000 |
4.777 |
118.7 |
| 3000 |
4.607 |
100.2 |
| 4000 |
4.497 |
89.8 |
| 4932 |
4.446 |
85.3 |
Results: SLM Scaling Laws
All models trained on the same curated philosophy corpus (BPE vocab=2000, ctx=256):
| Model |
Params |
Val Loss |
PPL |
Architecture |
| JuliaFluxGPT-1M |
1.01M |
4.45 |
85.3 |
LLaMA MHA |
| SymbioSLM |
4.07M |
3.62 |
37.3 |
3-organelle (Lux.jl) |
| MonarchSLM |
4.98M |
3.65 |
38.4 |
Monarch Mixer (Lux.jl) |
| JuliaSLM |
5.04M |
3.54 |
34.5 |
Transformer MHA (Lux.jl) |
| SymbioGPT-10M |
11.05M |
3.56 |
35.2 |
4-organelle (PyTorch) |
Key Findings
- 1M β 5M (5Γ params): loss drops 4.45 β 3.54 (β0.91), PPL 85 β 35 (β59%)
- 5M β 11M (2.2Γ params): loss 3.54 β 3.56 β plateau reached
- Architecture matters less than data quality at this scale β JuliaSLM, MonarchSLM, and SymbioSLM all converge to ~3.6 despite very different mixing mechanisms
- The curated 266M-token corpus appears to be the binding constraint above ~5M params
Usage
import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT
config = JuliaFluxConfig(
d_model=128, n_layers=4, n_heads=4, n_kv_heads=4,
head_dim=32, context_length=256, vocab_size=2000,
weight_tying=True,
)
model = JuliaFluxGPT(config)
ckpt = torch.load("juliaflux_1m_best.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state_dict"])
Links