LisaMegaWatts
/

MicroJulia

@@ -1,38 +1,185 @@
 ---
 language:
-- en
-library_name: julia
-pipeline_tag: text-generation
 tags:
-- character-level
-- philosophy
-- mathematics
-- julia
-- scalar-autograd
-- pure-julia
-datasets:
-- LisaMegaWatts/microjulia-data
 ---
 # MicroJulia
-A minimal character-level GPT built entirely in pure Julia with scalar autograd. No external ML dependencies.
 ## Architecture
-- 1 transformer layer, 4 attention heads
-- n_embd=16, block_size=64
-- RMSNorm, ReLU, KV cache for causal masking
-- Adam optimizer with linear LR decay
-- ~5K parameters
-## Vocabulary
-27 characters (a-z + space) + BOS = 28 vocab
 ## Training
-- **Dataset:** Aristotle's Rhetoric + Euclid's Elements (8,487 chunks)
-- **Current checkpoint:** step 150, val_loss=2.4315
-## Links
-- [Live inference (HF Space)](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia)
-- [Training data](https://huggingface.co/datasets/LisaMegaWatts/microjulia-data)
-- [Source code](https://github.com/DavinciDreams/micro-julia)

 ---
 language:
+  - en
+license: mit
+library_name: flux
 tags:
+  - julia
+  - flux-jl
+  - gpt-2
+  - character-level
+  - philosophy
+  - transformer
+  - text-generation
+  - layernorm
+  - gelu
+  - learned-position-embeddings
+pipeline_tag: text-generation
 ---
 # MicroJulia
+A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.
+## Model Family Context
+MicroJulia is the starting point of an architectural progression:
+| Model | Generation | Architecture | Tokenizer | Framework |
+|---|---|---|---|---|
+| **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** |
+| [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
+| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
+| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
+| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |
 ## Architecture
+Classic GPT-2 design — deliberately minimal:
+```
+GPT (GPT-2 style)
++-- wte: Embedding(vocab_size -> n_embd)      [token embeddings]
++-- wpe: Embedding(block_size -> n_embd)      [learned position embeddings]
++-- drop: Dropout
++-- blocks x N:
+|   +-- ln1: LayerNorm(n_embd)
+|   +-- attn: CausalSelfAttention
+|   |   +-- qkv: Dense(n_embd -> 3*n_embd)   [fused Q/K/V projection]
+|   |   +-- proj: Dense(n_embd -> n_embd)
+|   +-- ln2: LayerNorm(n_embd)
+|   +-- ffwd: FeedForward
+|       +-- Dense(n_embd -> 4*n_embd)
+|       +-- GELU
+|       +-- Dense(4*n_embd -> n_embd)
++-- ln_f: LayerNorm(n_embd)
++-- lm_head: Dense(n_embd -> vocab_size)
+```
+### Key Design Choices (GPT-2 era)
+| Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
+|---|---|---|
+| Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
+| Activation | GELU | SwiGLU |
+| Position encoding | Learned embeddings | RoPE |
+| QKV projection | Fused single Dense | Separate Q, K, V |
+| FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
+| Output head | Separate lm_head | Weight-tied with embedding |
+| Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |
+### Character-Level Tokenization
+Uses a minimal character vocabulary:
+```
+a-z, space, period (28 characters)
+```
+Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.
+**Trade-offs:**
+- Simpler tokenizer implementation
+- No OOV (out-of-vocabulary) issues
+- Model must spend capacity on character-level patterns
+- Less efficient than BPE for the same context window
+## Model Details
+| Parameter | Value |
+|---|---|
+| Architecture | GPT-2 style (pre-norm Transformer) |
+| Tokenizer | Character-level (~28 characters) |
+| Position encoding | Learned position embeddings |
+| Normalization | LayerNorm |
+| Activation | GELU |
+| Output projection | Separate Dense (not weight-tied) |
+| Framework | Julia + Flux.jl |
+Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.
 ## Training
+| | Value |
+|---|---|
+| Dataset | Classical philosophy texts |
+| Tokenizer | Character-level mapping |
+| Framework | Julia + Flux.jl |
+| Hardware | Google Colab / NVIDIA GPU |
+| Precision | Float32 |
+## Implementation Notes
+### Causal Masking
+Uses a pre-computed additive upper-triangular mask (global constant):
+```julia
+CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
+```
+Applied to attention scores before softmax.
+### Position Embeddings
+Learned absolute position embeddings (not RoPE):
+```julia
+tok = wte(token_ids)    # (C, T, B)
+pos = wpe(1:T)          # (C, T, 1) broadcast to batch
+x = tok .+ pos
+```
+Limited to the trained block_size — no length extrapolation.
+## Usage
+### OpenAI-Compatible API
+Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):
+```bash
+curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "hello"}],
+    "stream": true
+  }'
+```
+## Files
+| File | Description |
+|---|---|
+| `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) |
+| `vocab.json` | Character vocabulary mapping |
+Checkpoint contains:
+- `model_state` — Flux model weights
+- `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head
+- `step` — Training step
+- `best_val_loss` — Best validation loss
+## Provenance
+- **Author**: LisaMegaWatts
+- **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
+- **Training date**: February 2026
+- **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
+- **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family
+## References
+- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
+- Karpathy, A. (2023). nanoGPT. GitHub repository.
+## Citation
+```bibtex
+@misc{microjulia2026,
+  title={MicroJulia: A Minimal Character-Level GPT in Julia},
+  author={LisaMegaWatts},
+  year={2026},
+  url={https://huggingface.co/LisaMegaWatts/MicroJulia}
+}
+```
+## License
+MIT