GhostLM / README.md

docs: HF-flavored model card for v0.5.0 chat-v3 (CTIBench 36.9%)

9740f19 verified 11 days ago

9.1 kB

	---
	license: mit
	language:
	- en
	tags:
	- cybersecurity
	- security
	- cti
	- mitre-attack
	- cve
	- chat
	- from-scratch
	- small-lm
	library_name: pytorch
	pipeline_tag: text-generation
	model-index:
	- name: GhostLM ghost-small chat-v3
	results:
	- task:
	type: text-classification
	name: Multiple-choice cyber-LLM benchmark
	dataset:
	type: AI4Sec/cti-bench
	name: CTIBench MCQ
	config: cti-mcq
	split: test
	metrics:
	- type: accuracy
	value: 0.369
	name: accuracy (chat-v3)
	- type: accuracy
	value: 0.190
	name: accuracy (chat-v2)
	- type: accuracy
	value: 0.178
	name: accuracy (pretrain only, no chat)
	---

	# GhostLM

	A small cybersecurity language model trained from scratch. Not a fine-tune
	of an existing base — every parameter learned on a curated security
	corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`.

	- Repository: [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
	- Live demo: [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm)
	- License: MIT

	## What this model is

	`ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained
	from random initialization on 12.56M tokens of cybersecurity text —
	NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns,
	Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus
	a small synthetic CTF-writeup augmentation. After 30,000 steps of
	pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a
	mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal
	pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common
	acronyms). The chat tune uses three new role tokens
	(`<\|ghost_user\|>`, `<\|ghost_assistant\|>`, `<\|ghost_end\|>`) appended after
	the base GPT-2 BPE vocabulary (50,261 → 50,264).

	## Why a 45M from-scratch model

	A 45M model is too small to be a general-purpose assistant. The thesis is
	specialization: a focused security corpus + targeted SFT can match or
	beat much larger general models on narrow security tasks at a fraction
	of the size, while running on a laptop CPU. CTIBench results below are
	the test of that thesis.

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Type \| Decoder-only Transformer (GPT-2 family) \|
	\| Layers \| 6 \|
	\| Hidden dim (`d_model`) \| 512 \|
	\| Heads \| 8 (head dim 64) \|
	\| FFN dim (`d_ff`) \| 2048, GELU \|
	\| Norm \| LayerNorm, pre-norm \|
	\| Positional encoding \| Learned absolute \|
	\| Vocabulary \| 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) \|
	\| Context length \| 1024 \|
	\| Total params \| ~45.2M \|
	\| Tied input/output embeddings \| yes \|

	The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the
	codebase but disabled by default for backward compatibility with this
	checkpoint. A `ghost-small-v0.5` preset flips them on for the next
	pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens).

	## Evaluation

	Scored on [CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ
	(2,500 multiple-choice cyber threat-intelligence questions, scored by the
	log-probability of A/B/C/D as the next token after `Answer:`):

	\| Checkpoint \| n \| Accuracy \| Notes \|
	\|---\|---:\|---:\|---\|
	\| ghost-small Phase 4 (pretrain only) \| 2500 \| 17.8% \| below random — completion model, doesn't follow MCQ format \|
	\| ghost-small chat-v2 (free-form chat tune) \| 2500 \| 19.0% \| identity + OOD refusal works, no MCQ-format signal \|
	\| ghost-small chat-v2 + RAG (top-4) \| 2500 \| 19.0% \| retrieval is neutral without RAFT-style training \|
	\| ghost-small chat-v3 (MCQ-tuned) \| 2500 \| 36.9% \| canonical — 1.48× random \|

	`chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic,
	acronym definition) to the chat training mix. The assistant turn is the
	bare letter A/B/C/D, with a 30% subset followed by a one-line
	justification. This teaches the model to output a single letter after
	`Answer:` rather than continuing into prose — the dominant failure mode
	of small models on MCQ format.

	Honest comparisons: 36.9% is well above random (25%) but well below the
	85-95% that frontier models score on the same benchmark. The model was
	trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the
	Chinchilla-optimal data budget for 45M parameters. The next bench bump is
	expected to come from corpus expansion (`v0.4.2`) and the v0.5
	architecture upgrade.

	## Usage

	### Direct use (no HF transformers integration)

	GhostLM has a custom architecture — it does not use the
	HuggingFace `transformers` library and is not auto-loadable via
	`AutoModelForCausalLM`. You need the GhostLM repo itself.

	```bash
	git clone https://github.com/joemunene-by/GhostLM
	cd GhostLM
	pip install -r requirements.txt
	```

	Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`:

	```python
	from huggingface_hub import hf_hub_download
	from pathlib import Path

	dest = Path("checkpoints/phase5_chat_v3")
	dest.mkdir(parents=True, exist_ok=True)
	hf_hub_download(
	repo_id="Ghostgim/GhostLM",
	filename="pytorch_model.pt",
	local_dir=str(dest),
	)
	# Rename to match what the loader expects
	(dest / "pytorch_model.pt").rename(dest / "best_model.pt")
	```

	### Chat REPL

	```bash
	PYTHONPATH=. python3 scripts/chat.py \
	--checkpoint checkpoints/phase5_chat_v3/best_model.pt \
	--temperature 0.7 --top-k 40 --top-p 0.95 \
	--repetition-penalty 1.25
	```

	Chat format uses three special tokens:

	```
	<\|ghost_user\|>What is XSS?<\|ghost_end\|>
	<\|ghost_assistant\|>Cross-Site Scripting is a vulnerability where ...<\|ghost_end\|>
	```

	Use the helper to build an inference-ready prompt:

	```python
	from ghostlm.tokenizer import GhostTokenizer
	tok = GhostTokenizer()
	prompt_ids = tok.format_chat_prompt([
	{"role": "user", "content": "What is XSS?"},
	])
	# prompt_ids ends in <\|ghost_assistant\|> ready for generation
	```

	### MCP server for Claude Code / Claude Desktop

	GhostLM ships an MCP server that exposes three tools — `ghostlm_query`,
	`ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install
	with:

	```bash
	claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \
	--checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt
	```

	Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`.

	## Training data

	\| Source \| Records \| License \| Notes \|
	\|---\|---:\|---\|---\|
	\| NVD CVE descriptions \| 64,559 \| Public domain (NIST) \| sampled to ~3,500 chat Q&A + 1,000 MCQ pairs \|
	\| Exploit-DB \| 4,711 \| GPLv2-compatible \| sampled to 2,000 chat pairs \|
	\| Synthetic CTF writeups \| 2,847 \| Custom — generated locally \| turned into 2,847 chat pairs \|
	\| arXiv cs.CR \| 1,890 \| arXiv non-exclusive (per paper) \| abstracts only in v0.4 corpus \|
	\| MITRE ATT&CK \| 655 \| Apache 2.0 \| all 655 chat pairs + 655 MCQs \|
	\| CAPEC \| 563 \| Public domain (CISA / MITRE) \| all 563 chat pairs \|
	\| CTFtime writeups \| 451 \| per-author (mostly CC) \| all 451 chat pairs \|
	\| Hand-written small_talk \| 153 \| MIT \| greetings / identity / OOD refusals \|

	Total chat training set: ~17,000 records after 30× small_talk oversampling
	and 2× MCQ oversampling. The `data/raw/` source files and the
	`data/processed/train.jsonl` pretrain corpus are reproducible from the
	collector scripts in the GitHub repo.

	## Limitations

	- No general world knowledge. Outside cybersecurity the model is
	wrong, repetitive, or both. It will refuse politely on most OOD
	topics ("what's the weather", "tell me a joke") but accuracy on
	general questions is essentially zero.
	- Specific facts unreliable. Exact CVE numbers, CVSS scores, dates,
	and technique IDs are memorized incompletely — the model often
	confabulates plausible-looking but wrong specifics. Always verify
	against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org),
	or original vendor advisories.
	- Short coherence window. 1024-token context, no RoPE — long
	multi-turn conversations drift. The chat REPL trims old turns when
	the running prompt overflows.
	- CTIBench 36.9% is well above random but well below larger models.
	This is expected at 45M parameters and 12.56M training tokens.
	- Repetition prone without penalty. Use `--repetition-penalty 1.25`
	in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops.

	## Intended use

	- Hands-on learning: explore how a small specialized LM behaves on a
	narrow domain.
	- Local cybersecurity Q&A as a complement to a larger general model
	via the MCP server.
	- Research baseline for cyber-LLM evaluation work — a small, fully
	reproducible from-scratch model with published benchmark numbers.

	Out of scope: production security advice, vulnerability triage,
	incident response. The model is a research artifact — never act on its
	output without verifying against authoritative sources.

	## Citation

	```
	@misc{ghostlm-2026,
	author = {Munene, Joe},
	title = {GhostLM: A small cybersecurity language model trained from scratch},
	year = {2026},
	url = {https://github.com/joemunene-by/GhostLM},
	}
	```