LisaMegaWatts commited on
Commit
0ea0737
·
verified ·
1 Parent(s): c250c9f

Add model card with architecture details, provenance, and training metrics

Browse files
Files changed (1) hide show
  1. README.md +172 -25
README.md CHANGED
@@ -1,38 +1,185 @@
1
  ---
2
  language:
3
- - en
4
- library_name: julia
5
- pipeline_tag: text-generation
6
  tags:
7
- - character-level
8
- - philosophy
9
- - mathematics
10
- - julia
11
- - scalar-autograd
12
- - pure-julia
13
- datasets:
14
- - LisaMegaWatts/microjulia-data
 
 
 
15
  ---
16
 
17
  # MicroJulia
18
 
19
- A minimal character-level GPT built entirely in pure Julia with scalar autograd. No external ML dependencies.
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Architecture
22
- - 1 transformer layer, 4 attention heads
23
- - n_embd=16, block_size=64
24
- - RMSNorm, ReLU, KV cache for causal masking
25
- - Adam optimizer with linear LR decay
26
- - ~5K parameters
27
 
28
- ## Vocabulary
29
- 27 characters (a-z + space) + BOS = 28 vocab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training
32
- - **Dataset:** Aristotle's Rhetoric + Euclid's Elements (8,487 chunks)
33
- - **Current checkpoint:** step 150, val_loss=2.4315
34
 
35
- ## Links
36
- - [Live inference (HF Space)](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia)
37
- - [Training data](https://huggingface.co/datasets/LisaMegaWatts/microjulia-data)
38
- - [Source code](https://github.com/DavinciDreams/micro-julia)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
+ - en
4
+ license: mit
5
+ library_name: flux
6
  tags:
7
+ - julia
8
+ - flux-jl
9
+ - gpt-2
10
+ - character-level
11
+ - philosophy
12
+ - transformer
13
+ - text-generation
14
+ - layernorm
15
+ - gelu
16
+ - learned-position-embeddings
17
+ pipeline_tag: text-generation
18
  ---
19
 
20
  # MicroJulia
21
 
22
+ A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The **first model** in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.
23
+
24
+ ## Model Family Context
25
+
26
+ MicroJulia is the starting point of an architectural progression:
27
+
28
+ | Model | Generation | Architecture | Tokenizer | Framework |
29
+ |---|---|---|---|---|
30
+ | **MicroJulia** | **1st** | **GPT-2 (LayerNorm, GELU, learned pos)** | **Character-level** | **Flux.jl** |
31
+ | [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) | 2nd | LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) | BPE 2000 | Flux.jl |
32
+ | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | 3rd | Modern Transformer (RMSNorm, SwiGLU, RoPE) | BPE 2000 | Lux.jl |
33
+ | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | 3rd | Monarch Mixer (sub-quadratic) | BPE 2000 | Lux.jl |
34
+ | [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) | 3rd | Symbiogenesis (3 organelles) | BPE 2000 | Lux.jl |
35
 
36
  ## Architecture
 
 
 
 
 
37
 
38
+ Classic GPT-2 design — deliberately minimal:
39
+
40
+ ```
41
+ GPT (GPT-2 style)
42
+ +-- wte: Embedding(vocab_size -> n_embd) [token embeddings]
43
+ +-- wpe: Embedding(block_size -> n_embd) [learned position embeddings]
44
+ +-- drop: Dropout
45
+ +-- blocks x N:
46
+ | +-- ln1: LayerNorm(n_embd)
47
+ | +-- attn: CausalSelfAttention
48
+ | | +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection]
49
+ | | +-- proj: Dense(n_embd -> n_embd)
50
+ | +-- ln2: LayerNorm(n_embd)
51
+ | +-- ffwd: FeedForward
52
+ | +-- Dense(n_embd -> 4*n_embd)
53
+ | +-- GELU
54
+ | +-- Dense(4*n_embd -> n_embd)
55
+ +-- ln_f: LayerNorm(n_embd)
56
+ +-- lm_head: Dense(n_embd -> vocab_size)
57
+ ```
58
+
59
+ ### Key Design Choices (GPT-2 era)
60
+
61
+ | Component | MicroJulia (GPT-2) | Later Models (LLaMA-style) |
62
+ |---|---|---|
63
+ | Normalization | LayerNorm (with bias) | RMSNorm (no bias) |
64
+ | Activation | GELU | SwiGLU |
65
+ | Position encoding | Learned embeddings | RoPE |
66
+ | QKV projection | Fused single Dense | Separate Q, K, V |
67
+ | FFN | Standard 4x expansion | SwiGLU 2/3 adjusted |
68
+ | Output head | Separate lm_head | Weight-tied with embedding |
69
+ | Tokenizer | Character-level (~28 chars) | BPE (2000 tokens) |
70
+
71
+ ### Character-Level Tokenization
72
+
73
+ Uses a minimal character vocabulary:
74
+ ```
75
+ a-z, space, period (28 characters)
76
+ ```
77
+
78
+ Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.
79
+
80
+ **Trade-offs:**
81
+ - Simpler tokenizer implementation
82
+ - No OOV (out-of-vocabulary) issues
83
+ - Model must spend capacity on character-level patterns
84
+ - Less efficient than BPE for the same context window
85
+
86
+ ## Model Details
87
+
88
+ | Parameter | Value |
89
+ |---|---|
90
+ | Architecture | GPT-2 style (pre-norm Transformer) |
91
+ | Tokenizer | Character-level (~28 characters) |
92
+ | Position encoding | Learned position embeddings |
93
+ | Normalization | LayerNorm |
94
+ | Activation | GELU |
95
+ | Output projection | Separate Dense (not weight-tied) |
96
+ | Framework | Julia + Flux.jl |
97
+
98
+ Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.
99
 
100
  ## Training
 
 
101
 
102
+ | | Value |
103
+ |---|---|
104
+ | Dataset | Classical philosophy texts |
105
+ | Tokenizer | Character-level mapping |
106
+ | Framework | Julia + Flux.jl |
107
+ | Hardware | Google Colab / NVIDIA GPU |
108
+ | Precision | Float32 |
109
+
110
+ ## Implementation Notes
111
+
112
+ ### Causal Masking
113
+
114
+ Uses a pre-computed additive upper-triangular mask (global constant):
115
+ ```julia
116
+ CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
117
+ ```
118
+ Applied to attention scores before softmax.
119
+
120
+ ### Position Embeddings
121
+
122
+ Learned absolute position embeddings (not RoPE):
123
+ ```julia
124
+ tok = wte(token_ids) # (C, T, B)
125
+ pos = wpe(1:T) # (C, T, 1) broadcast to batch
126
+ x = tok .+ pos
127
+ ```
128
+
129
+ Limited to the trained block_size — no length extrapolation.
130
+
131
+ ## Usage
132
+
133
+ ### OpenAI-Compatible API
134
+
135
+ Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):
136
+
137
+ ```bash
138
+ curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
139
+ -H "Content-Type: application/json" \
140
+ -d '{
141
+ "messages": [{"role": "user", "content": "hello"}],
142
+ "stream": true
143
+ }'
144
+ ```
145
+
146
+ ## Files
147
+
148
+ | File | Description |
149
+ |---|---|
150
+ | `checkpoint.jld2` | Trained model weights + hyperparams (JLD2 format) |
151
+ | `vocab.json` | Character vocabulary mapping |
152
+
153
+ Checkpoint contains:
154
+ - `model_state` — Flux model weights
155
+ - `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head
156
+ - `step` — Training step
157
+ - `best_val_loss` — Best validation loss
158
+
159
+ ## Provenance
160
+
161
+ - **Author**: LisaMegaWatts
162
+ - **Repository**: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
163
+ - **Training date**: February 2026
164
+ - **Architecture reference**: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
165
+ - **Lineage**: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family
166
+
167
+ ## References
168
+
169
+ - Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
170
+ - Karpathy, A. (2023). nanoGPT. GitHub repository.
171
+
172
+ ## Citation
173
+
174
+ ```bibtex
175
+ @misc{microjulia2026,
176
+ title={MicroJulia: A Minimal Character-Level GPT in Julia},
177
+ author={LisaMegaWatts},
178
+ year={2026},
179
+ url={https://huggingface.co/LisaMegaWatts/MicroJulia}
180
+ }
181
+ ```
182
+
183
+ ## License
184
+
185
+ MIT