File size: 5,640 Bytes

---
language:
  - en
license: mit
library_name: flux
tags:
  - julia
  - flux-jl
  - gpt-2
  - character-level
  - philosophy
  - transformer
  - text-generation
  - layernorm
  - gelu
  - learned-position-embeddings
pipeline_tag: text-generation
---

# MicroJulia

A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.

## Model Family Context

MicroJulia is the starting point of an architectural progression:

| Model | Generation | Architecture | Tokenizer | Framework |
|---|---|---|---|---|
| **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** |
| [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |

## Architecture

Classic GPT-2 design — deliberately minimal:

```
GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd)      [token embeddings]
+-- wpe: Embedding(block_size -> n_embd)      [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
|   +-- ln1: LayerNorm(n_embd)
|   +-- attn: CausalSelfAttention
|   |   +-- qkv: Dense(n_embd -> 3*n_embd)   [fused Q/K/V projection]
|   |   +-- proj: Dense(n_embd -> n_embd)
|   +-- ln2: LayerNorm(n_embd)
|   +-- ffwd: FeedForward
|       +-- Dense(n_embd -> 4*n_embd)
|       +-- GELU
|       +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)
```

### Key Design Choices (GPT-2 era)

| Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
|---|---|---|
| Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
| Activation | GELU | SwiGLU |
| Position encoding | Learned embeddings | RoPE |
| QKV projection | Fused single Dense | Separate Q, K, V |
| FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
| Output head | Separate lm_head | Weight-tied with embedding |
| Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |

### Character-Level Tokenization

Uses a minimal character vocabulary:
```
a-z, space, period (28 characters)
```

Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.

**Trade-offs:**
- Simpler tokenizer implementation
- No OOV (out-of-vocabulary) issues
- Model must spend capacity on character-level patterns
- Less efficient than BPE for the same context window

## Model Details

| Parameter | Value |
|---|---|
| Architecture | GPT-2 style (pre-norm Transformer) |
| Tokenizer | Character-level (~28 characters) |
| Position encoding | Learned position embeddings |
| Normalization | LayerNorm |
| Activation | GELU |
| Output projection | Separate Dense (not weight-tied) |
| Framework | Julia + Flux.jl |

Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.

## Training

| | Value |
|---|---|
| Dataset | Classical philosophy texts |
| Tokenizer | Character-level mapping |
| Framework | Julia + Flux.jl |
| Hardware | Google Colab / NVIDIA GPU |
| Precision | Float32 |

## Implementation Notes

### Causal Masking

Uses a pre-computed additive upper-triangular mask (global constant):
```julia
CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
```
Applied to attention scores before softmax.

### Position Embeddings

Learned absolute position embeddings (not RoPE):
```julia
tok = wte(token_ids)    # (C, T, B)
pos = wpe(1:T)          # (C, T, 1) broadcast to batch
x = tok .+ pos
```

Limited to the trained block_size — no length extrapolation.

## Usage

### OpenAI-Compatible API

Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):

```bash
curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "hello"}],
    "stream": true
  }'
```

## Files

| File | Description |
|---|---|
| `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) |
| `vocab.json` | Character vocabulary mapping |

Checkpoint contains:
- `model_state` — Flux model weights
- `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head
- `step` — Training step
- `best_val_loss` — Best validation loss

## Provenance

- **Author**: LisaMegaWatts
- **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
- **Training date**: February 2026
- **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
- **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family

## References

- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
- Karpathy, A. (2023). nanoGPT. GitHub repository.

## Citation

```bibtex
@misc{microjulia2026,
  title={MicroJulia: A Minimal Character-Level GPT in Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}
```

## License

MIT