File size: 5,640 Bytes
c250c9f
 
0ea0737
 
 
c250c9f
0ea0737
 
 
 
 
 
 
 
 
 
 
c250c9f
 
 
 
0ea0737
 
 
 
 
 
 
 
 
 
 
 
 
c250c9f
 
 
0ea0737
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c250c9f
 
 
0ea0737
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
language:
  - en
license: mit
library_name: flux
tags:
  - julia
  - flux-jl
  - gpt-2
  - character-level
  - philosophy
  - transformer
  - text-generation
  - layernorm
  - gelu
  - learned-position-embeddings
pipeline_tag: text-generation
---

# MicroJulia

A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.

## Model Family Context

MicroJulia is the starting point of an architectural progression:

| Model | Generation | Architecture | Tokenizer | Framework |
|---|---|---|---|---|
| **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** |
| [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |

## Architecture

Classic GPT-2 design — deliberately minimal:

```
GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd)      [token embeddings]
+-- wpe: Embedding(block_size -> n_embd)      [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
|   +-- ln1: LayerNorm(n_embd)
|   +-- attn: CausalSelfAttention
|   |   +-- qkv: Dense(n_embd -> 3*n_embd)   [fused Q/K/V projection]
|   |   +-- proj: Dense(n_embd -> n_embd)
|   +-- ln2: LayerNorm(n_embd)
|   +-- ffwd: FeedForward
|       +-- Dense(n_embd -> 4*n_embd)
|       +-- GELU
|       +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)
```

### Key Design Choices (GPT-2 era)

| Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
|---|---|---|
| Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
| Activation | GELU | SwiGLU |
| Position encoding | Learned embeddings | RoPE |
| QKV projection | Fused single Dense | Separate Q, K, V |
| FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
| Output head | Separate lm_head | Weight-tied with embedding |
| Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |

### Character-Level Tokenization

Uses a minimal character vocabulary:
```
a-z, space, period (28 characters)
```

Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.

**Trade-offs:**
- Simpler tokenizer implementation
- No OOV (out-of-vocabulary) issues
- Model must spend capacity on character-level patterns
- Less efficient than BPE for the same context window

## Model Details

| Parameter | Value |
|---|---|
| Architecture | GPT-2 style (pre-norm Transformer) |
| Tokenizer | Character-level (~28 characters) |
| Position encoding | Learned position embeddings |
| Normalization | LayerNorm |
| Activation | GELU |
| Output projection | Separate Dense (not weight-tied) |
| Framework | Julia + Flux.jl |

Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.

## Training

| | Value |
|---|---|
| Dataset | Classical philosophy texts |
| Tokenizer | Character-level mapping |
| Framework | Julia + Flux.jl |
| Hardware | Google Colab / NVIDIA GPU |
| Precision | Float32 |

## Implementation Notes

### Causal Masking

Uses a pre-computed additive upper-triangular mask (global constant):
```julia
CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
```
Applied to attention scores before softmax.

### Position Embeddings

Learned absolute position embeddings (not RoPE):
```julia
tok = wte(token_ids)    # (C, T, B)
pos = wpe(1:T)          # (C, T, 1) broadcast to batch
x = tok .+ pos
```

Limited to the trained block_size — no length extrapolation.

## Usage

### OpenAI-Compatible API

Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):

```bash
curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "hello"}],
    "stream": true
  }'
```

## Files

| File | Description |
|---|---|
| `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) |
| `vocab.json` | Character vocabulary mapping |

Checkpoint contains:
- `model_state` — Flux model weights
- `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head
- `step` — Training step
- `best_val_loss` — Best validation loss

## Provenance

- **Author**: LisaMegaWatts
- **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
- **Training date**: February 2026
- **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
- **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family

## References

- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
- Karpathy, A. (2023). nanoGPT. GitHub repository.

## Citation

```bibtex
@misc{microjulia2026,
  title={MicroJulia: A Minimal Character-Level GPT in Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}
```

## License

MIT