--- language: - en license: mit library_name: flux tags: - julia - flux-jl - llama-style - gqa - grouped-query-attention - rope - rmsnorm - swiglu - bpe - philosophy - text-generation pipeline_tag: text-generation datasets: - LisaMegaWatts/philosophy-corpus --- # JuliaFluxGPT A ~23M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl. ## Model Family Context JuliaFluxGPT is the **largest model** in the Julia SLM collection, using a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA): | Model | Framework | Architecture | Params | Attention | |---|---|---|---|---| | **JuliaFluxGPT** | **Flux.jl** | **LLaMA-style GQA** | **~23M** | **8Q/2KV GQA** | | [SymbioGPT-10M](https://huggingface.co/LisaMegaWatts/SymbioGPT-10M) | PyTorch | 4-organelle SymbioGPT | 11.6M | OrganelleGate | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Lux.jl | Transformer | 5.04M | 4-head MHA | | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Lux.jl | Monarch Mixer | 4.98M | 8-head Monarch | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Lux.jl | Symbiogenesis | ~4.1M | 3 organelles | | [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) | Flux.jl | GPT-2 style | ~1M | Standard MHA | ## Architecture ``` GPT (LLaMA-style) +-- wte: Embedding(2000 -> 512) [weight-tied with output projection] +-- blocks x 8: | +-- ln1: RMSNorm(512) | +-- attn: CausalSelfAttention | | +-- wq: Dense(512 -> 512) [8 query heads, 64 dim each] | | +-- wkv: Dense(512 -> 256) [2 KV heads, 64 dim each, fused K+V] | | +-- proj: Dense(512 -> 512) | +-- ln2: RMSNorm(512) | +-- ffwd: SwiGLUFFN | +-- w_gate: Dense(512 -> 1344) [gate path] | +-- w_up: Dense(512 -> 1344) [value path] | +-- w_down: Dense(1344 -> 512) +-- ln_f: RMSNorm(512) +-- [output: weight-tied with wte] ``` ### Grouped Query Attention (GQA) GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality: - **8 query heads** (64 dim each) = full expressiveness in queries - **2 KV heads** (64 dim each) = 4x KV memory reduction - **4 query heads per KV group** = each KV head is shared by 4 query heads - KV heads are repeated (expanded) to match query head count before attention computation **Attention parameter savings:** - Standard MHA: Q(512x512) + K(512x512) + V(512x512) + O(512x512) = 1,048,576 - GQA 8Q/2KV: Q(512x512) + KV(512x256) + O(512x512) = 655,360 (37% reduction) ### RoPE (Rotary Position Embeddings) Applied to Q and K after projection, before attention scores: ``` cos_cache, sin_cache = precompute_rope_freqs(head_dim=64, max_seq_len=256) q_rotated = apply_rope(q, cos, sin, T) k_rotated = apply_rope(k, cos, sin, T) ``` ### SwiGLU FFN ``` hidden = max(64, round_to_64(4 * 512 * 2/3)) = 1344 gate = swish(w_gate(x)) value = w_up(x) output = w_down(gate * value) ``` ## Model Details | Parameter | Value | |---|---| | Total parameters | ~23M (22,790,656) | | Embedding dim | 512 | | Layers | 8 | | Query heads | 8 | | KV heads | 2 (GQA ratio = 4:1) | | Head dim | 64 | | FFN hidden dim | 1344 | | Context length | 256 tokens | | Vocabulary | 2,000 (ByteLevel BPE) | | Position encoding | RoPE (base=10000) | | Weight tying | Yes (forward pass uses wte.weight directly) | | Bias | false (all layers) | | Dropout | 0.1 (training), 0.0 (inference) | ## Training | | Value | |---|---| | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | | Corpus | Classical philosophy and mathematics texts | | Tokenizer | BPE (HuggingFace tokenizer.json format, 2000 tokens) | | Framework | Julia + Flux.jl | | Hardware | NVIDIA RTX 3060 12GB | | Precision | Float32 | | Best val loss | 6.622 (step 28998) | | Dropout | 0.1 | ## Implementation Notes ### Flux.jl vs Lux.jl JuliaFluxGPT uses **Flux.jl** (implicit parameters, `@layer` macro) rather than Lux.jl (explicit parameters). Key differences: | | Flux.jl (this model) | Lux.jl (JuliaSLM family) | |---|---|---| | Parameter style | Implicit (stored in model struct) | Explicit (separate `ps` NamedTuple) | | State management | `Flux.testmode!()` | Explicit state `st` | | Serialization | `Flux.loadmodel!()` | JLD2 direct load | | AD backend | Zygote | Zygote | ### Weight Tying Implementation Weight tying is implemented in the forward pass rather than through a separate tied layer: ```julia function (m::GPT)(idx) # ... forward through blocks ... x = m.ln_f(x) W = m.wte.weight # reuse embedding weights out = W' * reshape(x, C, T*B) # transpose matmul reshape(out, vocab_size, T, B) end ``` This avoids complications with `Flux.loadmodel!` when loading checkpoints. ## Usage ### OpenAI-Compatible API Served via [JuliaFluxGPT Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaFluxGPT): ```bash curl -X POST https://lisamegawatts-juliafluxgpt.hf.space/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200, "temperature": 0.8, "top_k": 40 }' ``` Streaming supported with `"stream": true`. ## Files | File | Description | |---|---| | `best_model.jld2` | Best checkpoint (step 28998, val_loss=6.622) | | `final_model.jld2` | Final checkpoint | | `checkpoint_latest.jld2` | Latest training checkpoint | | `tokenizer.json` | BPE tokenizer (HuggingFace format, 2000 tokens) | Checkpoint contains: - `model_state` — Flux model weights - `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head - `step` — Training step at checkpoint - `best_val_loss` — Best validation loss achieved ## Provenance - **Author**: LisaMegaWatts - **Source**: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis) - **Training notebook**: `juliaflux_v2.ipynb` - **Training date**: February 2026 - **Architecture reference**: LLaMA (Touvron et al., 2023) with GQA (Ainslie et al., 2023) ## References - Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. - Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. - Karpathy, A. (2023). nanoGPT. GitHub repository. ## Citation ```bibtex @misc{juliafluxgpt2026, title={JuliaFluxGPT: A LLaMA-style GQA Model in Julia/Flux.jl}, author={LisaMegaWatts}, year={2026}, url={https://huggingface.co/LisaMegaWatts/JuliaFluxGPT} } ``` ## License MIT