Space LLM - 10M Parameter Astronomy Language Model
A custom decoder-only transformer trained from scratch on space/astronomy text.
Model Details
- Parameters: 10.4 million
- Architecture: Decoder-only transformer
- 6 layers, 256 hidden dim, 8 attention heads
- RoPE (Rotary Position Embedding)
- SwiGLU activation
- RMSNorm
- Tied input/output embeddings
- Context length: 256 tokens
- Vocab size: 16,000 (SentencePiece BPE)
- Framework: JAX/Flax
- Training: TPU v3-8
Training Data
~34 million tokens collected from:
- Wikipedia (space/astronomy articles via API)
- arXiv (astro-ph, gr-qc, hep-ph abstracts)
- NASA APIs (APOD, Image Library, NSSDCA planetary fact sheets)
- Comprehensive space knowledge base
Performance
- Best validation loss: 5.24
- Perplexity: 188
Files
best.pkl- Best model checkpoint (Flax params, pickle format)final.pkl- Final model checkpointmodel_config.json- Model architecture configspace_tokenizer.model- SentencePiece tokenizerstep_*.pkl- Intermediate checkpoints
Usage
import pickle, json
import jax, jax.numpy as jnp
from flax import linen as nn
import sentencepiece as spm
# Load config and weights
with open("model_config.json") as f:
cfg = json.load(f)
with open("best.pkl", "rb") as f:
params = pickle.load(f)["params"]
# Load tokenizer
sp = spm.SentencePieceProcessor(model_file="space_tokenizer.model")
# See https://github.com/korfalor-cloud/space-llm for full model code
Source Code
Full training code: https://github.com/korfalor-cloud/space-llm
Limitations
This is a small (10M param) model trained on limited data (34M tokens). It learns space vocabulary and topic associations but does not produce fully coherent prose. It's an educational demonstration of training an LLM from scratch.