Add model card with architecture details, provenance, and training metrics

0ea0737 verified 4 days ago

5.64 kB

	---
	language:
	- en
	license: mit
	library_name: flux
	tags:
	- julia
	- flux-jl
	- gpt-2
	- character-level
	- philosophy
	- transformer
	- text-generation
	- layernorm
	- gelu
	- learned-position-embeddings
	pipeline_tag: text-generation
	---

	# MicroJulia

	A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The first model in the Julia SLM lineage — a minimal proof-of-concept that established the training and serving infrastructure.

	## Model Family Context

	MicroJulia is the starting point of an architectural progression:

	\| Model \| Generation \| Architecture \| Tokenizer \| Framework \|
	\|---\|---\|---\|---\|---\|
	\| MicroJulia \| 1st \| GPT-2 (LayerNorm, GELU, learned pos) \| Character-level \| Flux.jl \|
	\| [JuliaFluxGPT](https://huggingface.co/LisaMegaWatts/JuliaFluxGPT) \| 2nd \| LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) \| BPE 2000 \| Flux.jl \|
	\| [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) \| 3rd \| Modern Transformer (RMSNorm, SwiGLU, RoPE) \| BPE 2000 \| Lux.jl \|
	\| [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) \| 3rd \| Monarch Mixer (sub-quadratic) \| BPE 2000 \| Lux.jl \|
	\| [SymbioSLM](https://huggingface.co/LisaMegaWatts/SymbioSLM) \| 3rd \| Symbiogenesis (3 organelles) \| BPE 2000 \| Lux.jl \|

	## Architecture

	Classic GPT-2 design — deliberately minimal:

	```
	GPT (GPT-2 style)
	+-- wte: Embedding(vocab_size -> n_embd) [token embeddings]
	+-- wpe: Embedding(block_size -> n_embd) [learned position embeddings]
	+-- drop: Dropout
	+-- blocks x N:
	\| +-- ln1: LayerNorm(n_embd)
	\| +-- attn: CausalSelfAttention
	\| \| +-- qkv: Dense(n_embd -> 3*n_embd) [fused Q/K/V projection]
	\| \| +-- proj: Dense(n_embd -> n_embd)
	\| +-- ln2: LayerNorm(n_embd)
	\| +-- ffwd: FeedForward
	\| +-- Dense(n_embd -> 4*n_embd)
	\| +-- GELU
	\| +-- Dense(4*n_embd -> n_embd)
	+-- ln_f: LayerNorm(n_embd)
	+-- lm_head: Dense(n_embd -> vocab_size)
	```

	### Key Design Choices (GPT-2 era)

	\| Component \| MicroJulia (GPT-2) \| Later Models (LLaMA-style) \|
	\|---\|---\|---\|
	\| Normalization \| LayerNorm (with bias) \| RMSNorm (no bias) \|
	\| Activation \| GELU \| SwiGLU \|
	\| Position encoding \| Learned embeddings \| RoPE \|
	\| QKV projection \| Fused single Dense \| Separate Q, K, V \|
	\| FFN \| Standard 4x expansion \| SwiGLU 2/3 adjusted \|
	\| Output head \| Separate lm_head \| Weight-tied with embedding \|
	\| Tokenizer \| Character-level (~28 chars) \| BPE (2000 tokens) \|

	### Character-Level Tokenization

	Uses a minimal character vocabulary:
	```
	a-z, space, period (28 characters)
	```

	Each character maps directly to a token ID. No subword segmentation — the model must learn word boundaries, morphology, and syntax from individual characters.

	Trade-offs:
	- Simpler tokenizer implementation
	- No OOV (out-of-vocabulary) issues
	- Model must spend capacity on character-level patterns
	- Less efficient than BPE for the same context window

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Architecture \| GPT-2 style (pre-norm Transformer) \|
	\| Tokenizer \| Character-level (~28 characters) \|
	\| Position encoding \| Learned position embeddings \|
	\| Normalization \| LayerNorm \|
	\| Activation \| GELU \|
	\| Output projection \| Separate Dense (not weight-tied) \|
	\| Framework \| Julia + Flux.jl \|

	Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint `hyperparams` dict and loaded dynamically.

	## Training

	\| \| Value \|
	\|---\|---\|
	\| Dataset \| Classical philosophy texts \|
	\| Tokenizer \| Character-level mapping \|
	\| Framework \| Julia + Flux.jl \|
	\| Hardware \| Google Colab / NVIDIA GPU \|
	\| Precision \| Float32 \|

	## Implementation Notes

	### Causal Masking

	Uses a pre-computed additive upper-triangular mask (global constant):
	```julia
	CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)
	```
	Applied to attention scores before softmax.

	### Position Embeddings

	Learned absolute position embeddings (not RoPE):
	```julia
	tok = wte(token_ids) # (C, T, B)
	pos = wpe(1:T) # (C, T, 1) broadcast to batch
	x = tok .+ pos
	```

	Limited to the trained block_size — no length extrapolation.

	## Usage

	### OpenAI-Compatible API

	Served via [MicroJulia Space](https://huggingface.co/spaces/LisaMegaWatts/MicroJulia):

	```bash
	curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [{"role": "user", "content": "hello"}],
	"stream": true
	}'
	```

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `checkpoint.jld2` \| Trained model weights + hyperparams (JLD2 format) \|
	\| `vocab.json` \| Character vocabulary mapping \|

	Checkpoint contains:
	- `model_state` — Flux model weights
	- `hyperparams` — Dict with vocab_size, n_embd, block_size, n_layer, n_head
	- `step` — Training step
	- `best_val_loss` — Best validation loss

	## Provenance

	- Author: LisaMegaWatts
	- Repository: [DavinciDreams/micro-julia](https://github.com/DavinciDreams/micro-julia)
	- Training date: February 2026
	- Architecture reference: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
	- Lineage: Evolved into [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (custom autograd) and the Lux.jl model family

	## References

	- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
	- Karpathy, A. (2023). nanoGPT. GitHub repository.

	## Citation

	```bibtex
	@misc{microjulia2026,
	title={MicroJulia: A Minimal Character-Level GPT in Julia},
	author={LisaMegaWatts},
	year={2026},
	url={https://huggingface.co/LisaMegaWatts/MicroJulia}
	}
	```

	## License

	MIT