ghost-small-gen
A small (~45M parameter) decoder-only language model trained entirely from scratch in PyTorch, no pretrained weights and no fine-tuning. It is the first generalist checkpoint in the GhostLM project: a model that broadened from a cybersecurity-only corpus into a small generalist while keeping cybersecurity as its deepest specialty.
- Parameters: ~45M (6 layers, d_model 512, RoPE + SwiGLU + RMSNorm)
- Tokenizer: GPT-2 BPE (50,257) + 7 special tokens
- Training: 30,000 steps on a Mac M4 (MPS), final val_loss 3.76
- Corpus: a decontaminated 258.9M-token multi-domain corpus that is only 8.6% cybersecurity (general web, broad Wikipedia, code, math, instruction, and a cybersecurity layer), with 0.004% measured benchmark contamination
- Recipe: intra-document attention masking + a multi-stage domain curriculum
Evaluation (honest, full benchmark sets)
Debiased multi-permutation text-scoring. + means the 95% bootstrap CI lower
bound is above the 25% random baseline (significantly above chance). Peer
numbers are published zero-shot references for the small-model class; harnesses
differ, so treat them as context, not an exact comparison.
| Benchmark | n | ghost-small-gen (45M) | 95% CI | vs random | Peer reference |
|---|---|---|---|---|---|
| ARC-Easy | 2365 | 27.2% | 25.4-28.9 | + | Pythia-160M 43.5, 111M 34.8, 256M 37.6 |
| ARC-Challenge | 1165 | 24.3% | 22.1-26.6 | ~ | Pythia-160M 18.8, SmolLM2-360M 36.6 |
| OpenBookQA | 500 | 27.4% | 23.7-31.1 | ~ | 111M 27.8, 256M 25.4, LaMini-35M 26.2 |
| SecQA (cyber) | 210 | 34.3% | 28.5-40.6 | + | retained specialty |
| CTF eval (cyber) | 30 | 63.3% | 46.7-80.0 | + | retained specialty |
What this says, calibrated for a 45M model trained from scratch on free compute:
- The generalist pivot worked. Three of five benchmarks are statistically above the 25% random baseline, on a corpus only 8.6% cybersecurity.
- Cybersecurity is fully retained and is the standout (SecQA 34.3%, CTF 63.3%).
- It is competitive within its own size class: OpenBookQA beats the 256M and 35M survey peers, and ARC-Challenge beats Pythia-160M, a roughly 3.5x larger model.
- It does not clear the 35-45% competitive band on ARC-Easy (27.2%): above chance, not state of the art there.
This is a solid, defensible result for the size and the compute, not a "beats everything" claim.
Intended use and limitations
Research and education: a transparent, hand-written small model for studying from-scratch training, generalist corpus design, and cybersecurity-aware language modeling. It is a base model (not instruction-tuned), small, and will hallucinate. Do not use it for safety-critical decisions. The cybersecurity content is for defensive and educational understanding.
How to load
This is a custom architecture, not a transformers model. Load it with the
GhostLM code:
import torch
from safetensors.torch import load_model
from ghostlm.config import GhostLMConfig
from ghostlm.model import GhostLM
from huggingface_hub import hf_hub_download
import json
repo = "Ghostgim/ghost-small-gen"
cfg = GhostLMConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items()
if k in GhostLMConfig().__dataclass_fields__})
model = GhostLM(cfg)
load_model(model, hf_hub_download(repo, "model.safetensors"))
model.eval()
See the GhostLM repository for the model code, tokenizer, generation utilities, and the full scorecard.
Training details
- Base architecture: ghost-small-v0.5 (RoPE, SwiGLU, RMSNorm)
- Context length: 512, batch 16, grad-accum 4, lr 3e-4, 2000-step warmup
- Intra-document attention masking so packed documents do not attend across EOS boundaries
- Multi-stage domain curriculum that shifts the data mixture across training (broad web early, code/math/knowledge upweighted later)
- Corpus decontaminated against every evaluation benchmark before training
License
MIT. Built and trained by Joe Munene.
- Downloads last month
- 19