GhostLM v0.9 chat (81M, from-scratch cybersecurity LM)

GhostLM is a multi-rung scale-ladder cybersecurity language model trained entirely from scratch in PyTorch. v0.9 chat is the bench winner of the ghost-small (45-81M) line on every multiple-choice benchmark we evaluated. It is also where the line saturates: at 81M parameters the model has the register of cybersec writing but not the facts in any retrievable form. The next rung is ghost-base (~360M, SmolLM2-360M shape), gated on rented GPU compute.

This repo holds the slim inference checkpoint (best_model.pt, 324 MB, model + config only, optimizer state stripped).

v0.9.5 update (2026-05-08): nine differentiation bets, 1,505 templated SFT records ready

The strategic frame went from "six bets, three measured" (v0.9.4) to "nine bets, all shipped, 1,505 deterministic SFT records ready for the v1.0 GPU run." The new bets answer "what would make GhostLM exceptional, beyond what general-purpose small LMs offer?"

Strategic frame: docs/differentiation.md.

Bet	Status	Result
1. Tool-grounded SFT	training data ready	424 templated traces, 98.6% acceptance under `trace_quality_ok`; ~10% "not found" injection trains lookup-failure acknowledgement. tool_use_synth.md
2. Daily LoRA over fresh threat-intel	scaffolded	`scripts/daily_finetune.py`, ~1-2 GPU hr/day
3. Custom 32K BPE	measured + settled	+4.0% on cyber, -2.5% on general vs GPT-2 BPE; +25-35% projection falsified. bpe_corpus_ablation.md
4. Long context via RoPE NTK	scaffolded	`scripts/extend_context_ntk.py`, ~3-5 GPU hr
5. MoE for ghost-1B+	smoke validated	100-step training PASS. moe_training_smoke.md; presets `ghost-1b` (2.1B/1.2B-active) and `ghost-3b` (6.0B/3.3B-active)
6. Format-aware pretrain (STIX/YARA/Sigma/MISP)	measured baseline + training data ready	v0.9 baseline locked at 0/32 = 0% [Wilson 95% CI 0.0-10.7]. 560 templated records ready. format_baseline_v09.md, format_synth.md
7. Code-for-security	NEW, training data ready	12-pattern bank covering OWASP-Top-10 CWE classes (Python/JS/C); 48 records, 100% pass. code_security_synth.md
8. Binary / hex literacy	NEW, training data ready, most novel bet	15-pattern bank: PE/ELF/Mach-O/ZIP/PDF/OLE2/PNG file magic, UPX/Themida packers, NOP sleds + x64 syscall, PE Optional Header Magic + Machine, x64 execve('/bin/sh') shellcode; 44 records, 100% pass. No other small cybersec LM does this. binary_literacy_synth.md
9. Provenance / cite tags	NEW, training data ready	429 cite-augmented tool-use traces with `<\|cite\|>{source_type}:{id}#field<\|/cite\|>` inline in the answer; 99.8% acceptance under `trace_with_cites_quality_ok`. Stacks on bet 1 for ~853-record SFT corpus. provenance_synth.md

Combined templated-synth corpus

Bet	Records	Acceptance
1 (tool-use, plain)	424	98.6%
6 (STIX / YARA / Sigma / MISP)	560	99.8%
7 (code-for-security)	48	100.0%
8 (binary / hex literacy)	44	100.0%
9 (cite-augmented tool-use)	429	99.8%
TOTAL	1,505	99.4%

That's the deterministic floor. LLM-distilled records on top (bet 1 production at ~$200, bet 6 production at ~$50-100 on Anthropic) bring the realistic ghost-base SFT mix to ~10K records for a few hundred dollars, with no GPU spend until the actual pretrain run.

The v0.9 chat checkpoint in this repo is unchanged; it's the baseline against which all bet measurements are made.

Bench numbers

All benches run with debiased multi-permutation text-scoring on checkpointed CPU/GPU inference. Methodology in docs/ctibench_bias_finding.md.

Benchmark	Records	v0.4 chat-v3	v0.7 chat	v0.9 chat	Random
CTIBench MCQ (full split)	2,500	27.6%	27.2%	28.9%	25.0%
In-repo CTF MCQ eval	30	50.0%	50.0%	59.2%	25.0%
SecQA (external, n=210)	210	35.0%	37.6%	39.3%	25.0%
Free-form fact recall	50	0/50	1/50	1/50	0/50

v0.9 wins every multiple-choice benchmark by 0.7-9.2 pp. The MCQ ranking holds across CTIBench, the in-repo CTF eval, and the external SecQA bench.

But free-form fact recall is at floor across the entire 81M ghost-small rung. A 50-question hand-written fact-recall set (CVE / CWE / MITRE / OWASP / crypto / protocol / misc) graded by substring match scores 0-2% across every chat-tune in the line. The v0.9 model's one "hit" ("256" appearing in a SHA-256 question) is arguably spurious. MCQ wins measure register matching and topic distinctness, not factual recall.

Architecture

Field	Value
Layers	6
d_model	768
Attention heads	12 (head_dim 64)
FFN	SwiGLU, hidden = `int(d_ff × 2/3)` rounded to 64 = 2048
Normalization	RMSNorm
Position	RoPE (base 10000)
Vocab	50,264 (GPT-2 50K BPE + 7 special tokens)
Context	512 train, 1024 inference
Total params	~81M

Same architecture as ghost-small-v0.7. The 273M-token v0.9 corpus is what produces the bench delta over v0.7.

Training data

Pretrain corpus: 273M tokens spanning

PRIMUS-Seed (Trend Micro AI Lab, Apache 2.0): curated cybersec text
PRIMUS-FineWeb (Trend Micro AI Lab, ODC-By): TinyBERT-filtered cybersec subset of CommonCrawl
NVD CVEs (NIST, public domain): full v2 description text
MITRE ATT&CK + CWE + CAPEC (MITRE, custom permissive): technique / weakness / pattern descriptions
OWASP (Top 10, ASVS, Cheat Sheets, WSTG; CC-BY-SA): web-app security guidance
IETF RFCs (BCP 78, public): security-relevant RFCs
CTFtime + Exploit-DB (open): real CTF write-ups and exploit POCs
arXiv cs.CR: full-text academic papers
fact-QA: ~11K Q&A pairs distilled by Qwen-14B from the corpus

Per-source breakdown in CORPUS.md.

Intended use

Educational: a transparent, hand-written reference implementation of a from-scratch decoder-only cybersecurity LM, trained on a curated corpus, with all code on GitHub and all recipes documented.
Research: a bench artifact for "what does an 81M from-scratch cyber LM actually score on CTIBench / SecQA?" The honest answer (28.9% / 39.3%) is meaningful evidence about the parameter-count requirement for factual recall on cybersec MCQ.

What this model is NOT for

Anything that depends on factual recall. Free-form fact recall is at floor. CVE numbers, version chains, MITRE technique IDs, CVSS scores produced by this model are unreliable. Verify against authoritative sources.
General-purpose tasks. Outside cybersecurity the model politely declines and returns to its domain. Do not expect it to summarize news, write code, or answer arbitrary questions.
Production cybersec workflows. Not for incident response, threat hunting, or any decision that affects real systems.

Loading

import torch
from huggingface_hub import hf_hub_download
from ghostlm.config import GhostLMConfig
from ghostlm.model import GhostLM
from ghostlm.tokenizer import GhostTokenizer
from dataclasses import fields

# Pull weights
ckpt_path = hf_hub_download(
    repo_id="Ghostgim/GhostLM-v0.9-experimental",
    filename="best_model.pt",
)

# Load
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
saved = ckpt["config"]
config = GhostLMConfig(**{
    f.name: saved[f.name] for f in fields(GhostLMConfig) if f.name in saved
})
model = GhostLM(config)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = GhostTokenizer()  # GPT-2 BPE + 7 special tokens

# Multi-turn chat using the role tokens
turns = [{"role": "user", "content": "What is XSS?"}]
prompt_ids = tokenizer.format_chat_prompt(turns)
# ... see scripts/chat.py for full generation loop

The full code (architecture, tokenizer, generation, eval, training) is in the GhostLM GitHub repo.

Live demo

huggingface.co/spaces/Ghostgim/ghostlm

Pulls these weights via hf_hub_download on first launch. CPU inference takes ~15-25 s per reply at the default 200-token cap. The demo is intentionally honest about the fact-recall floor; expect register-shaped output rather than reliable answers.

Caveats

Hallucination is the norm, not the exception. This is an 81M from-scratch model, not a fine-tuned 7B foundation model.
MCQ wins do not imply factual recall. Test with the free-form fact-recall benchmark, not just CTIBench.
Pretrain corpus is sub-Chinchilla. 273M tokens for 81M params is ~3× under Chinchilla-optimal; the chat tune partially compensates, but the model is undertrained relative to its capacity.

Citation

@misc{munene2026ghostlm,
  title         = {GhostLM: a from-scratch cybersecurity language model on a transparent scale ladder},
  author        = {Munene, Joe},
  year          = {2026},
  howpublished  = {\url{https://github.com/joemunene-by/GhostLM}},
  note          = {v0.9.5 release; 81M-parameter chat checkpoint plus nine differentiation bets, 1505 templated SFT records}
}

Roadmap

The next rung is ghost-base (~360M, SmolLM2-360M shape), gated on rented GPU compute. Acceptance gate:

≥40% on debiased CTIBench (full n=2500), OR
≥65% on the in-repo CTF MCQ eval, OR
≥30% on the 50-question free-form fact-recall set.

The fact-recall bar is the truth metric. Spec at docs/ghost_base_spec.md; multi-year pathway through ghost-7B in docs/hardware_pathway.md.

After ghost-base lands, the v0.9.4 differentiation bets compose on top of it: tool-use SFT (bet 1) on the fresh ghost-base, format-aware pretrain mix (bet 6) using the 560 templated records plus LLM-distilled traces, RoPE NTK context extension to 16K (bet 4), and eventually ghost-1B with native MoE from step 0 (bet 5). Sequencing detail in docs/differentiation.md.

License

Apache 2.0. Same license as the GhostLM source code.

Built by Joe Munene.

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train Ghostgim/GhostLM-v0.9-experimental

Space using Ghostgim/GhostLM-v0.9-experimental 1

Evaluation results

2-permutation per-perm avg on CTIBench MCQ
test set self-reported

0.289
accuracy on SecQA
self-reported

0.393
accuracy on GhostLM CTF eval
self-reported

0.592
substring-match accuracy on GhostLM fact-recall bench
self-reported

0.020