JuliaFluxGPT

A ~23M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl.

Model Family Context

JuliaFluxGPT is the largest model in the Julia SLM collection, using a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA):

Model Framework Architecture Params Attention
JuliaFluxGPT Flux.jl LLaMA-style GQA ~23M 8Q/2KV GQA
SymbioGPT-10M PyTorch 4-organelle SymbioGPT 11.6M OrganelleGate
JuliaSLM Lux.jl Transformer 5.04M 4-head MHA
MonarchSLM Lux.jl Monarch Mixer 4.98M 8-head Monarch
SymbioSLM Lux.jl Symbiogenesis ~4.1M 3 organelles
MicroJulia Flux.jl GPT-2 style ~1M Standard MHA

Architecture

GPT (LLaMA-style)
+-- wte: Embedding(2000 -> 512)         [weight-tied with output projection]
+-- blocks x 8:
|   +-- ln1: RMSNorm(512)
|   +-- attn: CausalSelfAttention
|   |   +-- wq: Dense(512 -> 512)       [8 query heads, 64 dim each]
|   |   +-- wkv: Dense(512 -> 256)      [2 KV heads, 64 dim each, fused K+V]
|   |   +-- proj: Dense(512 -> 512)
|   +-- ln2: RMSNorm(512)
|   +-- ffwd: SwiGLUFFN
|       +-- w_gate: Dense(512 -> 1344)  [gate path]
|       +-- w_up: Dense(512 -> 1344)    [value path]
|       +-- w_down: Dense(1344 -> 512)
+-- ln_f: RMSNorm(512)
+-- [output: weight-tied with wte]

Grouped Query Attention (GQA)

GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality:

  • 8 query heads (64 dim each) = full expressiveness in queries
  • 2 KV heads (64 dim each) = 4x KV memory reduction
  • 4 query heads per KV group = each KV head is shared by 4 query heads
  • KV heads are repeated (expanded) to match query head count before attention computation

Attention parameter savings:

  • Standard MHA: Q(512x512) + K(512x512) + V(512x512) + O(512x512) = 1,048,576
  • GQA 8Q/2KV: Q(512x512) + KV(512x256) + O(512x512) = 655,360 (37% reduction)

RoPE (Rotary Position Embeddings)

Applied to Q and K after projection, before attention scores:

cos_cache, sin_cache = precompute_rope_freqs(head_dim=64, max_seq_len=256)
q_rotated = apply_rope(q, cos, sin, T)
k_rotated = apply_rope(k, cos, sin, T)

SwiGLU FFN

hidden = max(64, round_to_64(4 * 512 * 2/3)) = 1344
gate = swish(w_gate(x))
value = w_up(x)
output = w_down(gate * value)

Model Details

Parameter Value
Total parameters ~23M (22,790,656)
Embedding dim 512
Layers 8
Query heads 8
KV heads 2 (GQA ratio = 4:1)
Head dim 64
FFN hidden dim 1344
Context length 256 tokens
Vocabulary 2,000 (ByteLevel BPE)
Position encoding RoPE (base=10000)
Weight tying Yes (forward pass uses wte.weight directly)
Bias false (all layers)
Dropout 0.1 (training), 0.0 (inference)

Training

Value
Dataset philosophy-corpus
Corpus Classical philosophy and mathematics texts
Tokenizer BPE (HuggingFace tokenizer.json format, 2000 tokens)
Framework Julia + Flux.jl
Hardware NVIDIA RTX 3060 12GB
Precision Float32
Best val loss 6.622 (step 28998)
Dropout 0.1

Implementation Notes

Flux.jl vs Lux.jl

JuliaFluxGPT uses Flux.jl (implicit parameters, @layer macro) rather than Lux.jl (explicit parameters). Key differences:

Flux.jl (this model) Lux.jl (JuliaSLM family)
Parameter style Implicit (stored in model struct) Explicit (separate ps NamedTuple)
State management Flux.testmode!() Explicit state st
Serialization Flux.loadmodel!() JLD2 direct load
AD backend Zygote Zygote

Weight Tying Implementation

Weight tying is implemented in the forward pass rather than through a separate tied layer:

function (m::GPT)(idx)
    # ... forward through blocks ...
    x = m.ln_f(x)
    W = m.wte.weight           # reuse embedding weights
    out = W' * reshape(x, C, T*B)  # transpose matmul
    reshape(out, vocab_size, T, B)
end

This avoids complications with Flux.loadmodel! when loading checkpoints.

Usage

OpenAI-Compatible API

Served via JuliaFluxGPT Space:

curl -X POST https://lisamegawatts-juliafluxgpt.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'

Streaming supported with "stream": true.

Files

File Description
best_model.jld2 Best checkpoint (step 28998, val_loss=6.622)
final_model.jld2 Final checkpoint
checkpoint_latest.jld2 Latest training checkpoint
tokenizer.json BPE tokenizer (HuggingFace format, 2000 tokens)

Checkpoint contains:

  • model_state — Flux model weights
  • hyperparams — Dict with vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head
  • step — Training step at checkpoint
  • best_val_loss — Best validation loss achieved

Provenance

  • Author: LisaMegaWatts
  • Source: DavinciDreams/symbiogenesis
  • Training notebook: juliaflux_v2.ipynb
  • Training date: February 2026
  • Architecture reference: LLaMA (Touvron et al., 2023) with GQA (Ainslie et al., 2023)

References

  • Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models.
  • Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
  • Karpathy, A. (2023). nanoGPT. GitHub repository.

Citation

@misc{juliafluxgpt2026,
  title={JuliaFluxGPT: A LLaMA-style GQA Model in Julia/Flux.jl},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/JuliaFluxGPT}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train LisaMegaWatts/JuliaFluxGPT