Fix model card: match actual HF checkpoint (d=512, 8L, 8Q/2KV, ~23M params, ctx=256, FFN=1344)
afa692e verified | language: | |
| - en | |
| license: mit | |
| library_name: flux | |
| tags: | |
| - julia | |
| - flux-jl | |
| - llama-style | |
| - gqa | |
| - grouped-query-attention | |
| - rope | |
| - rmsnorm | |
| - swiglu | |
| - bpe | |
| - philosophy | |
| - text-generation | |
| pipeline_tag: text-generation | |
| datasets: | |
| - LisaMegaWatts/philosophy-corpus | |
| # JuliaFluxGPT | |
| A ~23M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl. | |
| ## Model Family Context | |
| JuliaFluxGPT is the **largest model** in the Julia SLM collection, using a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA): | |
| | Model | Framework | Architecture | Params | Attention | | |
| |---|---|---|---|---| | |
| | **JuliaFluxGPT** | **Flux.jl** | **LLaMA-style GQA** | **~23M** | **8Q/2KV GQA** | | |
| | [SymbioGPT-10M](https://huggingface.co/LisaMegaWatts/SymbioGPT-10M) | PyTorch | 4-organelle SymbioGPT | 11.6M | OrganelleGate | | |
| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Lux.jl | Transformer | 5.04M | 4-head MHA | | |
| | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Lux.jl | Monarch Mixer | 4.98M | 8-head Monarch | | |
| | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Lux.jl | Symbiogenesis | ~4.1M | 3 organelles | | |
| | [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) | Flux.jl | GPT-2 style | ~1M | Standard MHA | | |
| ## Architecture | |
| ``` | |
| GPT (LLaMA-style) | |
| +-- wte: Embedding(2000 -> 512) [weight-tied with output projection] | |
| +-- blocks x 8: | |
| | +-- ln1: RMSNorm(512) | |
| | +-- attn: CausalSelfAttention | |
| | | +-- wq: Dense(512 -> 512) [8 query heads, 64 dim each] | |
| | | +-- wkv: Dense(512 -> 256) [2 KV heads, 64 dim each, fused K+V] | |
| | | +-- proj: Dense(512 -> 512) | |
| | +-- ln2: RMSNorm(512) | |
| | +-- ffwd: SwiGLUFFN | |
| | +-- w_gate: Dense(512 -> 1344) [gate path] | |
| | +-- w_up: Dense(512 -> 1344) [value path] | |
| | +-- w_down: Dense(1344 -> 512) | |
| +-- ln_f: RMSNorm(512) | |
| +-- [output: weight-tied with wte] | |
| ``` | |
| ### Grouped Query Attention (GQA) | |
| GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality: | |
| - **8 query heads** (64 dim each) = full expressiveness in queries | |
| - **2 KV heads** (64 dim each) = 4x KV memory reduction | |
| - **4 query heads per KV group** = each KV head is shared by 4 query heads | |
| - KV heads are repeated (expanded) to match query head count before attention computation | |
| **Attention parameter savings:** | |
| - Standard MHA: Q(512x512) + K(512x512) + V(512x512) + O(512x512) = 1,048,576 | |
| - GQA 8Q/2KV: Q(512x512) + KV(512x256) + O(512x512) = 655,360 (37% reduction) | |
| ### RoPE (Rotary Position Embeddings) | |
| Applied to Q and K after projection, before attention scores: | |
| ``` | |
| cos_cache, sin_cache = precompute_rope_freqs(head_dim=64, max_seq_len=256) | |
| q_rotated = apply_rope(q, cos, sin, T) | |
| k_rotated = apply_rope(k, cos, sin, T) | |
| ``` | |
| ### SwiGLU FFN | |
| ``` | |
| hidden = max(64, round_to_64(4 * 512 * 2/3)) = 1344 | |
| gate = swish(w_gate(x)) | |
| value = w_up(x) | |
| output = w_down(gate * value) | |
| ``` | |
| ## Model Details | |
| | Parameter | Value | | |
| |---|---| | |
| | Total parameters | ~23M (22,790,656) | | |
| | Embedding dim | 512 | | |
| | Layers | 8 | | |
| | Query heads | 8 | | |
| | KV heads | 2 (GQA ratio = 4:1) | | |
| | Head dim | 64 | | |
| | FFN hidden dim | 1344 | | |
| | Context length | 256 tokens | | |
| | Vocabulary | 2,000 (ByteLevel BPE) | | |
| | Position encoding | RoPE (base=10000) | | |
| | Weight tying | Yes (forward pass uses wte.weight directly) | | |
| | Bias | false (all layers) | | |
| | Dropout | 0.1 (training), 0.0 (inference) | | |
| ## Training | |
| | | Value | | |
| |---|---| | |
| | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) | | |
| | Corpus | Classical philosophy and mathematics texts | | |
| | Tokenizer | BPE (HuggingFace tokenizer.json format, 2000 tokens) | | |
| | Framework | Julia + Flux.jl | | |
| | Hardware | NVIDIA RTX 3060 12GB | | |
| | Precision | Float32 | | |
| | Best val loss | 6.622 (step 28998) | | |
| | Dropout | 0.1 | | |
| ## Implementation Notes | |
| ### Flux.jl vs Lux.jl | |
| JuliaFluxGPT uses **Flux.jl** (implicit parameters, `@layer` macro) rather than Lux.jl (explicit parameters). Key differences: | |
| | | Flux.jl (this model) | Lux.jl (JuliaSLM family) | | |
| |---|---|---| | |
| | Parameter style | Implicit (stored in model struct) | Explicit (separate `ps` NamedTuple) | | |
| | State management | `Flux.testmode!()` | Explicit state `st` | | |
| | Serialization | `Flux.loadmodel!()` | JLD2 direct load | | |
| | AD backend | Zygote | Zygote | | |
| ### Weight Tying Implementation | |
| Weight tying is implemented in the forward pass rather than through a separate tied layer: | |
| ```julia | |
| function (m::GPT)(idx) | |
| # ... forward through blocks ... | |
| x = m.ln_f(x) | |
| W = m.wte.weight # reuse embedding weights | |
| out = W' * reshape(x, C, T*B) # transpose matmul | |
| reshape(out, vocab_size, T, B) | |
| end | |
| ``` | |
| This avoids complications with `Flux.loadmodel!` when loading checkpoints. | |
| ## Usage | |
| ### OpenAI-Compatible API | |
| Served via [JuliaFluxGPT Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaFluxGPT): | |
| ```bash | |
| curl -X POST https://lisamegawatts-juliafluxgpt.hf.space/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "messages": [{"role": "user", "content": "the nature of"}], | |
| "max_tokens": 200, | |
| "temperature": 0.8, | |
| "top_k": 40 | |
| }' | |
| ``` | |
| Streaming supported with `"stream": true`. | |
| ## Files | |
| | File | Description | | |
| |---|---| | |
| | `best_model.jld2` | Best checkpoint (step 28998, val_loss=6.622) | | |
| | `final_model.jld2` | Final checkpoint | | |
| | `checkpoint_latest.jld2` | Latest training checkpoint | | |
| | `tokenizer.json` | BPE tokenizer (HuggingFace format, 2000 tokens) | | |
| Checkpoint contains: | |
| - `model_state` — Flux model weights | |
| - `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head | |
| - `step` — Training step at checkpoint | |
| - `best_val_loss` — Best validation loss achieved | |
| ## Provenance | |
| - **Author**: LisaMegaWatts | |
| - **Source**: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis) | |
| - **Training notebook**: `juliaflux_v2.ipynb` | |
| - **Training date**: February 2026 | |
| - **Architecture reference**: LLaMA (Touvron et al., 2023) with GQA (Ainslie et al., 2023) | |
| ## References | |
| - Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. | |
| - Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. | |
| - Karpathy, A. (2023). nanoGPT. GitHub repository. | |
| ## Citation | |
| ```bibtex | |
| @misc{juliafluxgpt2026, | |
| title={JuliaFluxGPT: A LLaMA-style GQA Model in Julia/Flux.jl}, | |
| author={LisaMegaWatts}, | |
| year={2026}, | |
| url={https://huggingface.co/LisaMegaWatts/JuliaFluxGPT} | |
| } | |
| ``` | |
| ## License | |
| MIT | |