LisaMegaWatts
/

JuliaFluxGPT

@@ -22,38 +22,38 @@ datasets:
 # JuliaFluxGPT
-A ~4M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl.
 ## Model Family Context
-JuliaFluxGPT uses a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA) than the other Julia SLM models:
 | Model | Framework | Architecture | Params | Attention |
 |---|---|---|---|---|
 | [SymbioGPT-10M](https://huggingface.co/LisaMegaWatts/SymbioGPT-10M) | PyTorch | 4-organelle SymbioGPT | 11.6M | OrganelleGate |
 | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Lux.jl | Transformer | 5.04M | 4-head MHA |
 | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Lux.jl | Monarch Mixer | 4.98M | 8-head Monarch |
 | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Lux.jl | Symbiogenesis | ~4.1M | 3 organelles |
-| **JuliaFluxGPT** | **Flux.jl** | **LLaMA-style GQA** | **~4M** | **4Q/2KV GQA** |
 | [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) | Flux.jl | GPT-2 style | ~1M | Standard MHA |
 ## Architecture
 ```
 GPT (LLaMA-style)
-+-- wte: Embedding(4000 -> 256)         [weight-tied with output projection]
-+-- blocks x 4:
-|   +-- ln1: RMSNorm(256)
 |   +-- attn: CausalSelfAttention
-|   |   +-- wq: Dense(256 -> 256)       [4 query heads, 64 dim each]
-|   |   +-- wkv: Dense(256 -> 256)      [2 KV heads, 64 dim each, fused K+V]
-|   |   +-- proj: Dense(256 -> 256)
-|   +-- ln2: RMSNorm(256)
 |   +-- ffwd: SwiGLUFFN
-|       +-- w_gate: Dense(256 -> 704)   [gate path]
-|       +-- w_up: Dense(256 -> 704)     [value path]
-|       +-- w_down: Dense(704 -> 256)
-+-- ln_f: RMSNorm(256)
 +-- [output: weight-tied with wte]
 ```
@@ -61,14 +61,14 @@ GPT (LLaMA-style)
 GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality:
-- **4 query heads** (64 dim each) = full expressiveness in queries
-- **2 KV heads** (64 dim each) = 2x KV memory reduction
-- **2 query heads per KV group** = each KV head is shared by 2 query heads
 - KV heads are repeated (expanded) to match query head count before attention computation
 **Attention parameter savings:**
-- Standard MHA: Q(256x256) + K(256x256) + V(256x256) + O(256x256) = 262,144
-- GQA 4Q/2KV: Q(256x256) + KV(256x256) + O(256x256) = 196,608 (25% reduction)
 ### RoPE (Rotary Position Embeddings)
@@ -82,7 +82,7 @@ k_rotated = apply_rope(k, cos, sin, T)
 ### SwiGLU FFN
 ```
-hidden = max(64, round_to_64(4 * 256 * 2/3)) = 704
 gate = swish(w_gate(x))
 value = w_up(x)
 output = w_down(gate * value)
@@ -92,15 +92,15 @@ output = w_down(gate * value)
 | Parameter | Value |
 |---|---|
-| Total parameters | ~4M (3,975,424) |
-| Embedding dim | 256 |
-| Layers | 4 |
-| Query heads | 4 |
-| KV heads | 2 (GQA ratio = 2:1) |
 | Head dim | 64 |
-| FFN hidden dim | 704 |
 | Context length | 256 tokens |
-| Vocabulary | 4,000 (ByteLevel BPE) |
 | Position encoding | RoPE (base=10000) |
 | Weight tying | Yes (forward pass uses wte.weight directly) |
 | Bias | false (all layers) |
@@ -112,13 +112,12 @@ output = w_down(gate * value)
 |---|---|
 | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
 | Corpus | Classical philosophy and mathematics texts |
-| Tokenizer | BPE (HuggingFace tokenizer.json format, 4000 tokens) |
 | Framework | Julia + Flux.jl |
 | Hardware | NVIDIA RTX 3060 12GB |
 | Precision | Float32 |
-| Best val loss | 6.795 (step 8730) |
-| Final step | 19,411 |
-| Distillation | KD alpha=0.5, temperature=4.0 |
 ## Implementation Notes
@@ -172,10 +171,10 @@ Streaming supported with `"stream": true`.
 | File | Description |
 |---|---|
-| `best_model.jld2` | Best checkpoint (step 8730, val_loss=6.795) |
-| `final_model.jld2` | Final checkpoint (step 19411) |
 | `checkpoint_latest.jld2` | Latest training checkpoint |
-| `tokenizer.json` | BPE tokenizer (HuggingFace format, 4000 tokens) |
 Checkpoint contains:
 - `model_state` — Flux model weights

 # JuliaFluxGPT
+A ~23M parameter LLaMA-style decoder-only model with Grouped Query Attention (GQA), trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl.
 ## Model Family Context
+JuliaFluxGPT is the **largest model** in the Julia SLM collection, using a different framework (Flux.jl vs Lux.jl) and a more modern attention design (GQA):
 | Model | Framework | Architecture | Params | Attention |
 |---|---|---|---|---|
+| **JuliaFluxGPT** | **Flux.jl** | **LLaMA-style GQA** | **~23M** | **8Q/2KV GQA** |
 | [SymbioGPT-10M](https://huggingface.co/LisaMegaWatts/SymbioGPT-10M) | PyTorch | 4-organelle SymbioGPT | 11.6M | OrganelleGate |
 | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | Lux.jl | Transformer | 5.04M | 4-head MHA |
 | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | Lux.jl | Monarch Mixer | 4.98M | 8-head Monarch |
 | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | Lux.jl | Symbiogenesis | ~4.1M | 3 organelles |
 | [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) | Flux.jl | GPT-2 style | ~1M | Standard MHA |
 ## Architecture
 ```
 GPT (LLaMA-style)
++-- wte: Embedding(2000 -> 512)         [weight-tied with output projection]
++-- blocks x 8:
+|   +-- ln1: RMSNorm(512)
 |   +-- attn: CausalSelfAttention
+|   |   +-- wq: Dense(512 -> 512)       [8 query heads, 64 dim each]
+|   |   +-- wkv: Dense(512 -> 256)      [2 KV heads, 64 dim each, fused K+V]
+|   |   +-- proj: Dense(512 -> 512)
+|   +-- ln2: RMSNorm(512)
 |   +-- ffwd: SwiGLUFFN
+|       +-- w_gate: Dense(512 -> 1344)  [gate path]
+|       +-- w_up: Dense(512 -> 1344)    [value path]
+|       +-- w_down: Dense(1344 -> 512)
++-- ln_f: RMSNorm(512)
 +-- [output: weight-tied with wte]
 ```
 GQA (Ainslie et al., 2023) uses fewer key-value heads than query heads, reducing KV-cache memory during inference while maintaining quality:
+- **8 query heads** (64 dim each) = full expressiveness in queries
+- **2 KV heads** (64 dim each) = 4x KV memory reduction
+- **4 query heads per KV group** = each KV head is shared by 4 query heads
 - KV heads are repeated (expanded) to match query head count before attention computation
 **Attention parameter savings:**
+- Standard MHA: Q(512x512) + K(512x512) + V(512x512) + O(512x512) = 1,048,576
+- GQA 8Q/2KV: Q(512x512) + KV(512x256) + O(512x512) = 655,360 (37% reduction)
 ### RoPE (Rotary Position Embeddings)
 ### SwiGLU FFN
 ```
+hidden = max(64, round_to_64(4 * 512 * 2/3)) = 1344
 gate = swish(w_gate(x))
 value = w_up(x)
 output = w_down(gate * value)
 | Parameter | Value |
 |---|---|
+| Total parameters | ~23M (22,790,656) |
+| Embedding dim | 512 |
+| Layers | 8 |
+| Query heads | 8 |
+| KV heads | 2 (GQA ratio = 4:1) |
 | Head dim | 64 |
+| FFN hidden dim | 1344 |
 | Context length | 256 tokens |
+| Vocabulary | 2,000 (ByteLevel BPE) |
 | Position encoding | RoPE (base=10000) |
 | Weight tying | Yes (forward pass uses wte.weight directly) |
 | Bias | false (all layers) |
 |---|---|
 | Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
 | Corpus | Classical philosophy and mathematics texts |
+| Tokenizer | BPE (HuggingFace tokenizer.json format, 2000 tokens) |
 | Framework | Julia + Flux.jl |
 | Hardware | NVIDIA RTX 3060 12GB |
 | Precision | Float32 |
+| Best val loss | 6.622 (step 28998) |
+| Dropout | 0.1 |
 ## Implementation Notes
 | File | Description |
 |---|---|
+| `best_model.jld2` | Best checkpoint (step 28998, val_loss=6.622) |
+| `final_model.jld2` | Final checkpoint |
 | `checkpoint_latest.jld2` | Latest training checkpoint |
+| `tokenizer.json` | BPE tokenizer (HuggingFace format, 2000 tokens) |
 Checkpoint contains:
 - `model_state` — Flux model weights