MicroJulia / README.md
LisaMegaWatts's picture
Add model card with architecture details, provenance, and training metrics
0ea0737 verified
---
language:
- en
license: mit
library_name: flux
tags:
- julia
- flux-jl
- gpt-2
- character-level
- philosophy
- transformer
- text-generation
- layernorm
- gelu
- learned-position-embeddings
pipeline_tag: text-generation
---
# MicroJulia
A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage β€” a minimal proof-of-concept that established the training and serving infrastructure.
## Model Family Context
MicroJulia is the starting point of an architectural progression:
| Model | Generation | Architecture | Tokenizer | Framework |
|---|---|---|---|---|
| **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** |
| [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |
## Architecture
Classic GPT-2 design β€” deliberately minimal:
```
GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd) [token embeddings]
+-- wpe: Embedding(block_size -> n_embd) [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
| +-- ln1: LayerNorm(n_embd)
| +-- attn: CausalSelfAttention
| | +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection]
| | +-- proj: Dense(n_embd -> n_embd)
| +-- ln2: LayerNorm(n_embd)
| +-- ffwd: FeedForward
| +-- Dense(n_embd -> 4*n_embd)
| +-- GELU
| +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)
```
### Key Design Choices (GPT-2 era)
| Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
|---|---|---|
| Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
| Activation | GELU | SwiGLU |
| Position encoding | Learned embeddings | RoPE |
| QKV projection | Fused single Dense | Separate Q, K, V |
| FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
| Output head | Separate lm_head | Weight-tied with embedding |
| Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |
### Character-Level Tokenization
Uses a minimal character vocabulary:
```
a-z, space, period (28 characters)
```
Each character maps directly to a token ID. No subword segmentation β€” the model must learn word boundaries, morphology, and syntax from individual characters.
**Trade-offs:**
- Simpler tokenizer implementation
- No OOV (out-of-vocabulary) issues
- Model must spend capacity on character-level patterns
- Less efficient than BPE for the same context window
## Model Details
| Parameter | Value |
|---|---|
| Architecture | GPT-2 style (pre-norm Transformer) |
| Tokenizer | Character-level (~28 characters) |
| Position encoding | Learned position embeddings |
| Normalization | LayerNorm |
| Activation | GELU |
| Output projection | Separate Dense (not weight-tied) |
| Framework | Julia + Flux.jl |
Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.
## Training
| | Value |
|---|---|
| Dataset | Classical philosophy texts |
| Tokenizer | Character-level mapping |
| Framework | Julia + Flux.jl |
| Hardware | Google Colab / NVIDIA GPU |
| Precision | Float32 |
## Implementation Notes
### Causal Masking
Uses a pre-computed additive upper-triangular mask (global constant):
```julia
CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
```
Applied to attention scores before softmax.
### Position Embeddings
Learned absolute position embeddings (not RoPE):
```julia
tok = wte(token_ids) # (C, T, B)
pos = wpe(1:T) # (C, T, 1) broadcast to batch
x = tok .+ pos
```
Limited to the trained block_size β€” no length extrapolation.
## Usage
### OpenAI-Compatible API
Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):
```bash
curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "hello"}],
"stream": true
}'
```
## Files
| File | Description |
|---|---|
| `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) |
| `vocab.json` | Character vocabulary mapping |
Checkpoint contains:
- `model_state` β€” Flux model weights
- `hyperparams` β€” Dict with vocab_size, n_embd, block_size, n_layer, n_head
- `step` β€” Training step
- `best_val_loss` β€” Best validation loss
## Provenance
- **Author**: LisaMegaWatts
- **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
- **Training date**: February 2026
- **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
- **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family
## References
- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
- Karpathy, A. (2023). nanoGPT. GitHub repository.
## Citation
```bibtex
@misc{microjulia2026,
title={MicroJulia: A Minimal Character-Level GPT in Julia},
author={LisaMegaWatts},
year={2026},
url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}
```
## License
MIT