--- language: - en license: mit library_name: flux tags: - julia - flux-jl - gpt-2 - character-level - philosophy - transformer - text-generation - layernorm - gelu - learned-position-embeddings pipeline_tag: text-generation --- # MicroJulia A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure. ## Model Family Context MicroJulia is the starting point of an architectural progression: | Model | Generation | Architecture | Tokenizer | Framework | |---|---|---|---|---| | **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** | | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl | | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl | ## Architecture Classic GPT-2 design — deliberately minimal: ``` GPT (GPT-2 style) +-- wte: Embedding(vocab_size -> n_embd) [token embeddings] +-- wpe: Embedding(block_size -> n_embd) [learned position embeddings] +-- drop: Dropout +-- blocks x N: | +-- ln1: LayerNorm(n_embd) | +-- attn: CausalSelfAttention | | +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection] | | +-- proj: Dense(n_embd -> n_embd) | +-- ln2: LayerNorm(n_embd) | +-- ffwd: FeedForward | +-- Dense(n_embd -> 4*n_embd) | +-- GELU | +-- Dense(4*n_embd -> n_embd) +-- ln_f: LayerNorm(n_embd) +-- lm_head: Dense(n_embd -> vocab_size) ``` ### Key Design Choices (GPT-2 era) | Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) | |---|---|---| | Normalization | LayerNorm (with bias) | RMSNorm (no bias) | | Activation | GELU | SwiGLU | | Position encoding | Learned embeddings | RoPE | | QKV projection | Fused single Dense | Separate Q, K, V | | FFN | Standard 4x expansion | SwiGLU 2/3 adjusted | | Output head | Separate lm_head | Weight-tied with embedding | | Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) | ### Character-Level Tokenization Uses a minimal character vocabulary: ``` a-z, space, period (28 characters) ``` Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters. **Trade-offs:** - Simpler tokenizer implementation - No OOV (out-of-vocabulary) issues - Model must spend capacity on character-level patterns - Less efficient than BPE for the same context window ## Model Details | Parameter | Value | |---|---| | Architecture | GPT-2 style (pre-norm Transformer) | | Tokenizer | Character-level (~28 characters) | | Position encoding | Learned position embeddings | | Normalization | LayerNorm | | Activation | GELU | | Output projection | Separate Dense (not weight-tied) | | Framework | Julia + Flux.jl | Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically. ## Training | | Value | |---|---| | Dataset | Classical philosophy texts | | Tokenizer | Character-level mapping | | Framework | Julia + Flux.jl | | Hardware | Google Colab / NVIDIA GPU | | Precision | Float32 | ## Implementation Notes ### Causal Masking Uses a pre-computed additive upper-triangular mask (global constant): ```julia CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1) ``` Applied to attention scores before softmax. ### Position Embeddings Learned absolute position embeddings (not RoPE): ```julia tok = wte(token_ids) # (C, T, B) pos = wpe(1:T) # (C, T, 1) broadcast to batch x = tok .+ pos ``` Limited to the trained block_size — no length extrapolation. ## Usage ### OpenAI-Compatible API Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia): ```bash curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "hello"}], "stream": true }' ``` ## Files | File | Description | |---|---| | `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) | | `vocab.json` | Character vocabulary mapping | Checkpoint contains: - `model_state` — Flux model weights - `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head - `step` — Training step - `best_val_loss` — Best validation loss ## Provenance - **Author**: LisaMegaWatts - **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia) - **Training date**: February 2026 - **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023) - **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family ## References - Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). - Karpathy, A. (2023). nanoGPT. GitHub repository. ## Citation ```bibtex @misc{microjulia2026, title={MicroJulia: A Minimal Character-Level GPT in Julia}, author={LisaMegaWatts}, year={2026}, url={https://huggingface.co/LisaMegaWatts/MicroJulia} } ``` ## License MIT