JuliaFluxGPT-1M

A 1.01M parameter LLaMA-style decoder-only transformer trained on the curated philosophy corpus. Scaling law data point for the JuliaFluxGPT model family.

Model Details

Parameter Value
Parameters 1,010,816 (1.01M)
Architecture LLaMA-style: RMSNorm, SwiGLU, RoPE, weight-tied output
d_model 128
Layers 4
Attention 4Q/4KV (full MHA), head_dim=32
FFN SwiGLU (inner_dim=192)
Vocab 2000 BPE tokens
Context 256 tokens
Precision bf16

Training

Detail Value
Dataset Curated philosophy corpus (266M tokens)
Tokenizer BPE, vocab=2000
Steps 4,932 (4Γ— Chinchilla budget)
Tokens seen 80M (80:1 tokens/param)
Batch size 64
Optimizer AdamW (Ξ²1=0.9, Ξ²2=0.95, wd=0.1)
LR 1e-3 β†’ 1e-4 (cosine, 200 warmup)
Hardware NVIDIA T4 16GB, Colab
Throughput ~25K tok/s
W&B p7yt1too

Training Curve

Step Val Loss Val PPL
250 6.539 691.5
500 5.445 231.6
1000 4.957 142.2
1250 4.851 127.8 (Chinchilla boundary)
2000 4.777 118.7
3000 4.607 100.2
4000 4.497 89.8
4932 4.446 85.3

Results: SLM Scaling Laws

All models trained on the same curated philosophy corpus (BPE vocab=2000, ctx=256):

Model Params Val Loss PPL Architecture
JuliaFluxGPT-1M 1.01M 4.45 85.3 LLaMA MHA
SymbioSLM 4.07M 3.62 37.3 3-organelle (Lux.jl)
MonarchSLM 4.98M 3.65 38.4 Monarch Mixer (Lux.jl)
JuliaSLM 5.04M 3.54 34.5 Transformer MHA (Lux.jl)
SymbioGPT-10M 11.05M 3.56 35.2 4-organelle (PyTorch)

Key Findings

  • 1M β†’ 5M (5Γ— params): loss drops 4.45 β†’ 3.54 (βˆ’0.91), PPL 85 β†’ 35 (βˆ’59%)
  • 5M β†’ 11M (2.2Γ— params): loss 3.54 β†’ 3.56 β€” plateau reached
  • Architecture matters less than data quality at this scale β€” JuliaSLM, MonarchSLM, and SymbioSLM all converge to ~3.6 despite very different mixing mechanisms
  • The curated 266M-token corpus appears to be the binding constraint above ~5M params

Usage

import torch
from juliaflux_model import JuliaFluxConfig, JuliaFluxGPT

config = JuliaFluxConfig(
    d_model=128, n_layers=4, n_heads=4, n_kv_heads=4,
    head_dim=32, context_length=256, vocab_size=2000,
    weight_tying=True,
)
model = JuliaFluxGPT(config)
ckpt = torch.load("juliaflux_1m_best.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state_dict"])

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support