ghost-small-gen

A small (~45M parameter) decoder-only language model trained entirely from scratch in PyTorch, no pretrained weights and no fine-tuning. It is the first generalist checkpoint in the GhostLM project: a model that broadened from a cybersecurity-only corpus into a small generalist while keeping cybersecurity as its deepest specialty.

Parameters: ~45M (6 layers, d_model 512, RoPE + SwiGLU + RMSNorm)
Tokenizer: GPT-2 BPE (50,257) + 7 special tokens
Training: 30,000 steps on a Mac M4 (MPS), final val_loss 3.76
Corpus: a decontaminated 258.9M-token multi-domain corpus that is only 8.6% cybersecurity (general web, broad Wikipedia, code, math, instruction, and a cybersecurity layer), with 0.004% measured benchmark contamination
Recipe: intra-document attention masking + a multi-stage domain curriculum

Evaluation (honest, full benchmark sets)

Debiased multi-permutation text-scoring. + means the 95% bootstrap CI lower bound is above the 25% random baseline (significantly above chance). Peer numbers are published zero-shot references for the small-model class; harnesses differ, so treat them as context, not an exact comparison.

Benchmark	n	ghost-small-gen (45M)	95% CI	vs random	Peer reference
ARC-Easy	2365	27.2%	25.4-28.9	+	Pythia-160M 43.5, 111M 34.8, 256M 37.6
ARC-Challenge	1165	24.3%	22.1-26.6	~	Pythia-160M 18.8, SmolLM2-360M 36.6
OpenBookQA	500	27.4%	23.7-31.1	~	111M 27.8, 256M 25.4, LaMini-35M 26.2
SecQA (cyber)	210	34.3%	28.5-40.6	+	retained specialty
CTF eval (cyber)	30	63.3%	46.7-80.0	+	retained specialty

What this says, calibrated for a 45M model trained from scratch on free compute:

The generalist pivot worked. Three of five benchmarks are statistically above the 25% random baseline, on a corpus only 8.6% cybersecurity.
Cybersecurity is fully retained and is the standout (SecQA 34.3%, CTF 63.3%).
It is competitive within its own size class: OpenBookQA beats the 256M and 35M survey peers, and ARC-Challenge beats Pythia-160M, a roughly 3.5x larger model.
It does not clear the 35-45% competitive band on ARC-Easy (27.2%): above chance, not state of the art there.

This is a solid, defensible result for the size and the compute, not a "beats everything" claim.

Intended use and limitations

Research and education: a transparent, hand-written small model for studying from-scratch training, generalist corpus design, and cybersecurity-aware language modeling. It is a base model (not instruction-tuned), small, and will hallucinate. Do not use it for safety-critical decisions. The cybersecurity content is for defensive and educational understanding.

How to load

This is a custom architecture, not a transformers model. Load it with the GhostLM code:

import torch
from safetensors.torch import load_model
from ghostlm.config import GhostLMConfig
from ghostlm.model import GhostLM
from huggingface_hub import hf_hub_download
import json

repo = "Ghostgim/ghost-small-gen"
cfg = GhostLMConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items()
                       if k in GhostLMConfig().__dataclass_fields__})
model = GhostLM(cfg)
load_model(model, hf_hub_download(repo, "model.safetensors"))
model.eval()

See the GhostLM repository for the model code, tokenizer, generation utilities, and the full scorecard.

Training details

Base architecture: ghost-small-v0.5 (RoPE, SwiGLU, RMSNorm)
Context length: 512, batch 16, grad-accum 4, lr 3e-4, 2000-step warmup
Intra-document attention masking so packed documents do not attend across EOS boundaries
Multi-stage domain curriculum that shifts the data mixture across training (broad web early, code/math/knowledge upweighted later)
Corpus decontaminated against every evaluation benchmark before training

License

MIT. Built and trained by Joe Munene.

Downloads last month: 19

Safetensors

Model size

45M params

Tensor type

F16