ghost-small-gen

A small (~45M parameter) decoder-only language model trained entirely from scratch in PyTorch, no pretrained weights and no fine-tuning. It is the first generalist checkpoint in the GhostLM project: a model that broadened from a cybersecurity-only corpus into a small generalist while keeping cybersecurity as its deepest specialty.

  • Parameters: ~45M (6 layers, d_model 512, RoPE + SwiGLU + RMSNorm)
  • Tokenizer: GPT-2 BPE (50,257) + 7 special tokens
  • Training: 30,000 steps on a Mac M4 (MPS), final val_loss 3.76
  • Corpus: a decontaminated 258.9M-token multi-domain corpus that is only 8.6% cybersecurity (general web, broad Wikipedia, code, math, instruction, and a cybersecurity layer), with 0.004% measured benchmark contamination
  • Recipe: intra-document attention masking + a multi-stage domain curriculum

Evaluation (honest, full benchmark sets)

Debiased multi-permutation text-scoring. + means the 95% bootstrap CI lower bound is above the 25% random baseline (significantly above chance). Peer numbers are published zero-shot references for the small-model class; harnesses differ, so treat them as context, not an exact comparison.

Benchmark n ghost-small-gen (45M) 95% CI vs random Peer reference
ARC-Easy 2365 27.2% 25.4-28.9 + Pythia-160M 43.5, 111M 34.8, 256M 37.6
ARC-Challenge 1165 24.3% 22.1-26.6 ~ Pythia-160M 18.8, SmolLM2-360M 36.6
OpenBookQA 500 27.4% 23.7-31.1 ~ 111M 27.8, 256M 25.4, LaMini-35M 26.2
SecQA (cyber) 210 34.3% 28.5-40.6 + retained specialty
CTF eval (cyber) 30 63.3% 46.7-80.0 + retained specialty

What this says, calibrated for a 45M model trained from scratch on free compute:

  • The generalist pivot worked. Three of five benchmarks are statistically above the 25% random baseline, on a corpus only 8.6% cybersecurity.
  • Cybersecurity is fully retained and is the standout (SecQA 34.3%, CTF 63.3%).
  • It is competitive within its own size class: OpenBookQA beats the 256M and 35M survey peers, and ARC-Challenge beats Pythia-160M, a roughly 3.5x larger model.
  • It does not clear the 35-45% competitive band on ARC-Easy (27.2%): above chance, not state of the art there.

This is a solid, defensible result for the size and the compute, not a "beats everything" claim.

Intended use and limitations

Research and education: a transparent, hand-written small model for studying from-scratch training, generalist corpus design, and cybersecurity-aware language modeling. It is a base model (not instruction-tuned), small, and will hallucinate. Do not use it for safety-critical decisions. The cybersecurity content is for defensive and educational understanding.

How to load

This is a custom architecture, not a transformers model. Load it with the GhostLM code:

import torch
from safetensors.torch import load_model
from ghostlm.config import GhostLMConfig
from ghostlm.model import GhostLM
from huggingface_hub import hf_hub_download
import json

repo = "Ghostgim/ghost-small-gen"
cfg = GhostLMConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items()
                       if k in GhostLMConfig().__dataclass_fields__})
model = GhostLM(cfg)
load_model(model, hf_hub_download(repo, "model.safetensors"))
model.eval()

See the GhostLM repository for the model code, tokenizer, generation utilities, and the full scorecard.

Training details

  • Base architecture: ghost-small-v0.5 (RoPE, SwiGLU, RMSNorm)
  • Context length: 512, batch 16, grad-accum 4, lr 3e-4, 2000-step warmup
  • Intra-document attention masking so packed documents do not attend across EOS boundaries
  • Multi-stage domain curriculum that shifts the data mixture across training (broad web early, code/math/knowledge upweighted later)
  • Corpus decontaminated against every evaluation benchmark before training

License

MIT. Built and trained by Joe Munene.

Downloads last month
19
Safetensors
Model size
45M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support