JuliaFluxGPT / README.md
LisaMegaWatts's picture
Fix model card: match actual HF checkpoint (d=512, 8L, 8Q/2KV, ~23M params, ctx=256, FFN=1344)
afa692e verified
---
language:
- en
license: mit
library_name: flux
tags:
- julia
- flux-jl
- llama-style
- gqa
- grouped-query-attention
- rope
- rmsnorm
- swiglu
- bpe
- philosophy
- text-generation
pipeline_tag: text-generation
datasets:
- LisaMegaWatts/philosophy-corpus
---
# JuliaFluxGPT
A ~23M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl.
## Model Family Context
JuliaFluxGPT is the **largest model** in the Julia SLM collection, using a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA):
| Model | Framework | Architecture | Params | Attention |
|---|---|---|---|---|
| **JuliaFluxGPT** | **Flux.jl** | **LLaMA-style GQA** | **~23M** | **8Q/2KV GQA** |
| [SymbioGPT-10M](https://huggingface.co/LisaMegaWatts/SymbioGPT-10M) | PyTorch | 4-organelle SymbioGPT | 11.6M | OrganelleGate |
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Lux.jl | Transformer | 5.04M | 4-head MHA |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Lux.jl | Monarch Mixer | 4.98M | 8-head Monarch |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Lux.jl | Symbiogenesis | ~4.1M | 3 organelles |
| [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) | Flux.jl | GPT-2 style | ~1M | Standard MHA |
## Architecture
```
GPT (LLaMA-style)
+-- wte: Embedding(2000 -> 512) [weight-tied with output projection]
+-- blocks x 8:
| +-- ln1: RMSNorm(512)
| +-- attn: CausalSelfAttention
| | +-- wq: Dense(512 -> 512) [8 query heads, 64 dim each]
| | +-- wkv: Dense(512 -> 256) [2 KV heads, 64 dim each, fused K+V]
| | +-- proj: Dense(512 -> 512)
| +-- ln2: RMSNorm(512)
| +-- ffwd: SwiGLUFFN
| +-- w_gate: Dense(512 -> 1344) [gate path]
| +-- w_up: Dense(512 -> 1344) [value path]
| +-- w_down: Dense(1344 -> 512)
+-- ln_f: RMSNorm(512)
+-- [output: weight-tied with wte]
```
### Grouped Query Attention (GQA)
GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality:
- **8 query heads** (64 dim each) = full expressiveness in queries
- **2 KV heads** (64 dim each) = 4x KV memory reduction
- **4 query heads per KV group** = each KV head is shared by 4 query heads
- KV heads are repeated (expanded) to match query head count before attention computation
**Attention parameter savings:**
- Standard MHA: Q(512x512) + K(512x512) + V(512x512) + O(512x512) = 1,048,576
- GQA 8Q/2KV: Q(512x512) + KV(512x256) + O(512x512) = 655,360 (37% reduction)
### RoPE (Rotary Position Embeddings)
Applied to Q and K after projection, before attention scores:
```
cos_cache, sin_cache = precompute_rope_freqs(head_dim=64, max_seq_len=256)
q_rotated = apply_rope(q, cos, sin, T)
k_rotated = apply_rope(k, cos, sin, T)
```
### SwiGLU FFN
```
hidden = max(64, round_to_64(4 * 512 * 2/3)) = 1344
gate = swish(w_gate(x))
value = w_up(x)
output = w_down(gate * value)
```
## Model Details
| Parameter | Value |
|---|---|
| Total parameters | ~23M (22,790,656) |
| Embedding dim | 512 |
| Layers | 8 |
| Query heads | 8 |
| KV heads | 2 (GQA ratio = 4:1) |
| Head dim | 64 |
| FFN hidden dim | 1344 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| Position encoding | RoPE (base=10000) |
| Weight tying | Yes (forward pass uses wte.weight directly) |
| Bias | false (all layers) |
| Dropout | 0.1 (training), 0.0 (inference) |
## Training
| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | Classical philosophy and mathematics texts |
| Tokenizer | BPE (HuggingFace tokenizer.json format, 2000 tokens) |
| Framework | Julia + Flux.jl |
| Hardware | NVIDIA RTX 3060 12GB |
| Precision | Float32 |
| Best val loss | 6.622 (step 28998) |
| Dropout | 0.1 |
## Implementation Notes
### Flux.jl vs Lux.jl
JuliaFluxGPT uses **Flux.jl** (implicit parameters, `@layer` macro) rather than Lux.jl (explicit parameters). Key differences:
| | Flux.jl (this model) | Lux.jl (JuliaSLM family) |
|---|---|---|
| Parameter style | Implicit (stored in model struct) | Explicit (separate `ps` NamedTuple) |
| State management | `Flux.testmode!()` | Explicit state `st` |
| Serialization | `Flux.loadmodel!()` | JLD2 direct load |
| AD backend | Zygote | Zygote |
### Weight Tying Implementation
Weight tying is implemented in the forward pass rather than through a separate tied layer:
```julia
function (m::GPT)(idx)
# ... forward through blocks ...
x = m.ln_f(x)
W = m.wte.weight # reuse embedding weights
out = W' * reshape(x, C, T*B) # transpose matmul
reshape(out, vocab_size, T, B)
end
```
This avoids complications with `Flux.loadmodel!` when loading checkpoints.
## Usage
### OpenAI-Compatible API
Served via [JuliaFluxGPT Space](https://huggingface.co/spaces/LisaMegaWatts/JuliaFluxGPT):
```bash
curl -X POST https://lisamegawatts-juliafluxgpt.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "the nature of"}],
"max_tokens": 200,
"temperature": 0.8,
"top_k": 40
}'
```
Streaming supported with `"stream": true`.
## Files
| File | Description |
|---|---|
| `best_model.jld2` | Best checkpoint (step 28998, val_loss=6.622) |
| `final_model.jld2` | Final checkpoint |
| `checkpoint_latest.jld2` | Latest training checkpoint |
| `tokenizer.json` | BPE tokenizer (HuggingFace format, 2000 tokens) |
Checkpoint contains:
- `model_state` — Flux model weights
- `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head
- `step` — Training step at checkpoint
- `best_val_loss` — Best validation loss achieved
## Provenance
- **Author**: LisaMegaWatts
- **Source**: [DavinciDreams/symbiogenesis](https://github.com/DavinciDreams/symbiogenesis)
- **Training notebook**: `juliaflux_v2.ipynb`
- **Training date**: February 2026
- **Architecture reference**: LLaMA (Touvron et al., 2023) with GQA (Ainslie et al., 2023)
## References
- Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models.
- Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
- Karpathy, A. (2023). nanoGPT. GitHub repository.
## Citation
```bibtex
@misc{juliafluxgpt2026,
title={JuliaFluxGPT: A LLaMA-style GQA Model in Julia/Flux.jl},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/JuliaFluxGPT}
}
```
## License
MIT