JuliaFluxGPT-1M

A 1.01M parameter LLaMA-style decoder-only transformer trained on the curated philosophy corpus. Scaling law data point for the JuliaFluxGPT model family.

Model Details

Parameter	Value
Parameters	1,010,816 (1.01M)
Architecture	LLaMA-style: RMSNorm, SwiGLU, RoPE, weight-tied output
d_model	128
Layers	4
Attention	4Q/4KV (full MHA), head_dim=32
FFN	SwiGLU (inner_dim=192)
Vocab	2000 BPE tokens
Context	256 tokens
Precision	bf16

Training

Detail	Value
Dataset	Curated philosophy corpus (266M tokens)
Tokenizer	BPE, vocab=2000
Steps	4,932 (4× Chinchilla budget)
Tokens seen	~~80M (~~80:1 tokens/param)
Batch size	64
Optimizer	AdamW (β1=0.9, β2=0.95, wd=0.1)
LR	1e-3 → 1e-4 (cosine, 200 warmup)
Hardware	NVIDIA T4 16GB, Colab
Throughput	~25K tok/s
W&B	p7yt1too

Training Curve

Step	Val Loss	Val PPL
250	6.539	691.5
500	5.445	231.6
1000	4.957	142.2
1250	4.851	127.8 (Chinchilla boundary)
2000	4.777	118.7
3000	4.607	100.2
4000	4.497	89.8
4932	4.446	85.3

Results: SLM Scaling Laws

All models trained on the same curated philosophy corpus (BPE vocab=2000, ctx=256):

Model	Params	Val Loss	PPL	Architecture
JuliaFluxGPT-1M	1.01M	4.45	85.3	LLaMA MHA
SymbioSLM	4.07M	3.62	37.3	3-organelle (Lux.jl)
MonarchSLM	4.98M	3.65	38.4	Monarch Mixer (Lux.jl)
JuliaSLM	5.04M	3.54	34.5	Transformer MHA (Lux.jl)
SymbioGPT-10M	11.05M	3.56	35.2	4-organelle (PyTorch)

Key Findings

1M → 5M (5× params): loss drops 4.45 → 3.54 (−0.91), PPL 85 → 35 (−59%)
5M → 11M (2.2× params): loss 3.54 → 3.56 — plateau reached
Architecture matters less than data quality at this scale — JuliaSLM, MonarchSLM, and SymbioSLM all converge to ~3.6 despite very different mixing mechanisms
The curated 266M-token corpus appears to be the binding constraint above ~5M params

Usage

import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT

config = JuliaFluxConfig(
    d_model=128, n_layers=4, n_heads=4, n_kv_heads=4,
    head_dim=32, context_length=256, vocab_size=2000,
    weight_tying=True,
)
model = JuliaFluxGPT(config)
ckpt = torch.load("juliaflux_1m_best.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state_dict"])

LisaMegaWatts
/

JuliaFluxGPT-1M