MicroJulia

A GPT-2 style character-level transformer trained on classical philosophy texts, implemented in Julia with Flux.jl. The first model in the Julia SLM lineage β€” a minimal proof-of-concept that established the training and serving infrastructure.

Model Family Context

MicroJulia is the starting point of an architectural progression:

Model Generation Architecture Tokenizer Framework
MicroJulia 1st GPT-2 (LayerNorm, GELU, learned pos) Character-level Flux.jl
JuliaFluxGPT 2nd LLaMA-style (RMSNorm, SwiGLU, RoPE, GQA) BPE 2000 Flux.jl
JuliaSLM 3rd Modern Transformer (RMSNorm, SwiGLU, RoPE) BPE 2000 Lux.jl
MonarchSLM 3rd Monarch Mixer (sub-quadratic) BPE 2000 Lux.jl
SymbioSLM 3rd Symbiogenesis (3 organelles) BPE 2000 Lux.jl

Architecture

Classic GPT-2 design β€” deliberately minimal:

GPT (GPT-2 style)
+-- wte: Embedding(vocab_size -> n_embd)      [token embeddings]
+-- wpe: Embedding(block_size -> n_embd)      [learned position embeddings]
+-- drop: Dropout
+-- blocks x N:
|   +-- ln1: LayerNorm(n_embd)
|   +-- attn: CausalSelfAttention
|   |   +-- qkv: Dense(n_embd -> 3*n_embd)   [fused Q/K/V projection]
|   |   +-- proj: Dense(n_embd -> n_embd)
|   +-- ln2: LayerNorm(n_embd)
|   +-- ffwd: FeedForward
|       +-- Dense(n_embd -> 4*n_embd)
|       +-- GELU
|       +-- Dense(4*n_embd -> n_embd)
+-- ln_f: LayerNorm(n_embd)
+-- lm_head: Dense(n_embd -> vocab_size)

Key Design Choices (GPT-2 era)

Component MicroJulia (GPT-2) Later Models (LLaMA-style)
Normalization LayerNorm (with bias) RMSNorm (no bias)
Activation GELU SwiGLU
Position encoding Learned embeddings RoPE
QKV projection Fused single Dense Separate Q, K, V
FFN Standard 4x expansion SwiGLU 2/3 adjusted
Output head Separate lm_head Weight-tied with embedding
Tokenizer Character-level (~28 chars) BPE (2000 tokens)

Character-Level Tokenization

Uses a minimal character vocabulary:

a-z, space, period (28 characters)

Each character maps directly to a token ID. No subword segmentation β€” the model must learn word boundaries, morphology, and syntax from individual characters.

Trade-offs:

  • Simpler tokenizer implementation
  • No OOV (out-of-vocabulary) issues
  • Model must spend capacity on character-level patterns
  • Less efficient than BPE for the same context window

Model Details

Parameter Value
Architecture GPT-2 style (pre-norm Transformer)
Tokenizer Character-level (~28 characters)
Position encoding Learned position embeddings
Normalization LayerNorm
Activation GELU
Output projection Separate Dense (not weight-tied)
Framework Julia + Flux.jl

Exact dimensions (vocab_size, n_embd, n_layer, n_head, block_size) are stored in the checkpoint hyperparams dict and loaded dynamically.

Training

Value
Dataset Classical philosophy texts
Tokenizer Character-level mapping
Framework Julia + Flux.jl
Hardware Google Colab / NVIDIA GPU
Precision Float32

Implementation Notes

Causal Masking

Uses a pre-computed additive upper-triangular mask (global constant):

CAUSAL_MASK = triu(fill(-Inf32, block_size, block_size), 1)

Applied to attention scores before softmax.

Position Embeddings

Learned absolute position embeddings (not RoPE):

tok = wte(token_ids)    # (C, T, B)
pos = wpe(1:T)          # (C, T, 1) broadcast to batch
x = tok .+ pos

Limited to the trained block_size β€” no length extrapolation.

Usage

OpenAI-Compatible API

Served via MicroJulia Space:

curl -X POST https://lisamegawatts-microjulia.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "hello"}],
    "stream": true
  }'

Files

File Description
checkpoint.jld2 Trained model weights + hyperparams (JLD2 format)
vocab.json Character vocabulary mapping

Checkpoint contains:

  • model_state β€” Flux model weights
  • hyperparams β€” Dict with vocab_size, n_embd, block_size, n_layer, n_head
  • step β€” Training step
  • best_val_loss β€” Best validation loss

Provenance

  • Author: LisaMegaWatts
  • Repository: DavinciDreams/micro-julia
  • Training date: February 2026
  • Architecture reference: GPT-2 (Radford et al., 2019), nanoGPT (Karpathy, 2023)
  • Lineage: Evolved into JuliaGPT (custom autograd) and the Lux.jl model family

References

  • Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
  • Karpathy, A. (2023). nanoGPT. GitHub repository.

Citation

@misc{microjulia2026,
  title={MicroJulia: A Minimal Character-Level GPT in Julia},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/MicroJulia}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support