| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: flux |
| | tags: |
| | - julia |
| | - flux-jl |
| | - gpt-2 |
| | - character-level |
| | - philosophy |
| | - transformer |
| | - text-generation |
| | - layernorm |
| | - gelu |
| | - learned-position-embeddings |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # MicroJulia |
| |
|
| | A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage β a minimal proof-of-concept that established the training and serving infrastructure. |
| |
|
| | ## Model Family Context |
| |
|
| | MicroJulia is the starting point of an architectural progression: |
| |
|
| | | Model | Generation | Architecture | Tokenizer | Framework | |
| | |---|---|---|---|---| |
| | | **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** | |
| | | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl | |
| | | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl | |
| | | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl | |
| | | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl | |
| |
|
| | ## Architecture |
| |
|
| | Classic GPT-2 design β deliberately minimal: |
| |
|
| | ``` |
| | GPT (GPT-2 style) |
| | +-- wte: Embedding(vocab_size -> n_embd) [token embeddings] |
| | +-- wpe: Embedding(block_size -> n_embd) [learned position embeddings] |
| | +-- drop: Dropout |
| | +-- blocks x N: |
| | | +-- ln1: LayerNorm(n_embd) |
| | | +-- attn: CausalSelfAttention |
| | | | +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection] |
| | | | +-- proj: Dense(n_embd -> n_embd) |
| | | +-- ln2: LayerNorm(n_embd) |
| | | +-- ffwd: FeedForward |
| | | +-- Dense(n_embd -> 4*n_embd) |
| | | +-- GELU |
| | | +-- Dense(4*n_embd -> n_embd) |
| | +-- ln_f: LayerNorm(n_embd) |
| | +-- lm_head: Dense(n_embd -> vocab_size) |
| | ``` |
| |
|
| | ### Key Design Choices (GPT-2 era) |
| |
|
| | | Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) | |
| | |---|---|---| |
| | | Normalization | LayerNorm (with bias) | RMSNorm (no bias) | |
| | | Activation | GELU | SwiGLU | |
| | | Position encoding | Learned embeddings | RoPE | |
| | | QKV projection | Fused single Dense | Separate Q, K, V | |
| | | FFN | Standard 4x expansion | SwiGLU 2/3 adjusted | |
| | | Output head | Separate lm_head | Weight-tied with embedding | |
| | | Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) | |
| | |
| | ### Character-Level Tokenization |
| | |
| | Uses a minimal character vocabulary: |
| | ``` |
| | a-z, space, period (28 characters) |
| | ``` |
| | |
| | Each character maps directly to a token ID. No subword segmentation β the model must learn word boundaries, morphology, and syntax from individual characters. |
| | |
| | **Trade-offs:** |
| | - Simpler tokenizer implementation |
| | - No OOV (out-of-vocabulary) issues |
| | - Model must spend capacity on character-level patterns |
| | - Less efficient than BPE for the same context window |
| | |
| | ## Model Details |
| | |
| | | Parameter | Value | |
| | |---|---| |
| | | Architecture | GPT-2 style (pre-norm Transformer) | |
| | | Tokenizer | Character-level (~28 characters) | |
| | | Position encoding | Learned position embeddings | |
| | | Normalization | LayerNorm | |
| | | Activation | GELU | |
| | | Output projection | Separate Dense (not weight-tied) | |
| | | Framework | Julia + Flux.jl | |
| | |
| | Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically. |
| |
|
| | ## Training |
| |
|
| | | | Value | |
| | |---|---| |
| | | Dataset | Classical philosophy texts | |
| | | Tokenizer | Character-level mapping | |
| | | Framework | Julia + Flux.jl | |
| | | Hardware | Google Colab / NVIDIA GPU | |
| | | Precision | Float32 | |
| |
|
| | ## Implementation Notes |
| |
|
| | ### Causal Masking |
| |
|
| | Uses a pre-computed additive upper-triangular mask (global constant): |
| | ```julia |
| | CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1) |
| | ``` |
| | Applied to attention scores before softmax. |
| |
|
| | ### Position Embeddings |
| |
|
| | Learned absolute position embeddings (not RoPE): |
| | ```julia |
| | tok = wte(token_ids) # (C, T, B) |
| | pos = wpe(1:T) # (C, T, 1) broadcast to batch |
| | x = tok .+ pos |
| | ``` |
| |
|
| | Limited to the trained block_size β no length extrapolation. |
| | |
| | ## Usage |
| | |
| | ### OpenAI-Compatible API |
| | |
| | Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia): |
| | |
| | ```bash |
| | curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \ |
| | -H "Content-Type: application/json" \ |
| | -d '{ |
| | "messages": [{"role": "user", "content": "hello"}], |
| | "stream": true |
| | }' |
| | ``` |
| | |
| | ## Files |
| | |
| | | File | Description | |
| | |---|---| |
| | | `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) | |
| | | `vocab.json` | Character vocabulary mapping | |
| | |
| | Checkpoint contains: |
| | - `model_state` β Flux model weights |
| | - `hyperparams` β Dict with vocab_size, n_embd, block_size, n_layer, n_head |
| | - `step` β Training step |
| | - `best_val_loss` β Best validation loss |
| | |
| | ## Provenance |
| | |
| | - **Author**: LisaMegaWatts |
| | - **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia) |
| | - **Training date**: February 2026 |
| | - **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023) |
| | - **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family |
| | |
| | ## References |
| | |
| | - Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). |
| | - Karpathy, A. (2023). nanoGPT. GitHub repository. |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @misc{microjulia2026, |
| | title={MicroJulia: A Minimal Character-Level GPT in Julia}, |
| | author={LisaMegaWatts}, |
| | year={2026}, |
| | url={https://huggingface.co/LisaMegaWatts/MicroJulia} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | MIT |
| | |