GhostLM / README.md
Ghostgim's picture
docs: HF-flavored model card for v0.5.0 chat-v3 (CTIBench 36.9%)
9740f19 verified
---
license: mit
language:
- en
tags:
- cybersecurity
- security
- cti
- mitre-attack
- cve
- chat
- from-scratch
- small-lm
library_name: pytorch
pipeline_tag: text-generation
model-index:
- name: GhostLM ghost-small chat-v3
results:
- task:
type: text-classification
name: Multiple-choice cyber-LLM benchmark
dataset:
type: AI4Sec/cti-bench
name: CTIBench MCQ
config: cti-mcq
split: test
metrics:
- type: accuracy
value: 0.369
name: accuracy (chat-v3)
- type: accuracy
value: 0.190
name: accuracy (chat-v2)
- type: accuracy
value: 0.178
name: accuracy (pretrain only, no chat)
---
# GhostLM
A small cybersecurity language model trained from scratch. Not a fine-tune
of an existing base — every parameter learned on a curated security
corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`.
- **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
- **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm)
- **License:** MIT
## What this model is
`ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained
from random initialization on **12.56M tokens of cybersecurity text** —
NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns,
Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus
a small synthetic CTF-writeup augmentation. After 30,000 steps of
pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a
mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal
pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common
acronyms). The chat tune uses three new role tokens
(`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after
the base GPT-2 BPE vocabulary (50,261 → 50,264).
## Why a 45M from-scratch model
A 45M model is too small to be a general-purpose assistant. The thesis is
specialization: a focused security corpus + targeted SFT can match or
beat much larger general models on narrow security tasks at a fraction
of the size, while running on a laptop CPU. CTIBench results below are
the test of that thesis.
## Architecture
| | |
|---|---|
| Type | Decoder-only Transformer (GPT-2 family) |
| Layers | 6 |
| Hidden dim (`d_model`) | 512 |
| Heads | 8 (head dim 64) |
| FFN dim (`d_ff`) | 2048, GELU |
| Norm | LayerNorm, pre-norm |
| Positional encoding | Learned absolute |
| Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) |
| Context length | 1024 |
| Total params | ~45.2M |
| Tied input/output embeddings | yes |
The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the
codebase but disabled by default for backward compatibility with this
checkpoint. A `ghost-small-v0.5` preset flips them on for the next
pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens).
## Evaluation
Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ**
(2,500 multiple-choice cyber threat-intelligence questions, scored by the
log-probability of A/B/C/D as the next token after `Answer:`):
| Checkpoint | n | Accuracy | Notes |
|---|---:|---:|---|
| ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format |
| ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal |
| ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training |
| **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** — 1.48× random |
`chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic,
acronym definition) to the chat training mix. The assistant turn is the
bare letter A/B/C/D, with a 30% subset followed by a one-line
justification. This teaches the model to output a single letter after
`Answer:` rather than continuing into prose — the dominant failure mode
of small models on MCQ format.
Honest comparisons: 36.9% is well above random (25%) but well below the
85-95% that frontier models score on the same benchmark. The model was
trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the
Chinchilla-optimal data budget for 45M parameters. The next bench bump is
expected to come from corpus expansion (`v0.4.2`) and the v0.5
architecture upgrade.
## Usage
### Direct use (no HF transformers integration)
GhostLM has a custom architecture — it does **not** use the
HuggingFace `transformers` library and is **not** auto-loadable via
`AutoModelForCausalLM`. You need the GhostLM repo itself.
```bash
git clone https://github.com/joemunene-by/GhostLM
cd GhostLM
pip install -r requirements.txt
```
Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`:
```python
from huggingface_hub import hf_hub_download
from pathlib import Path
dest = Path("checkpoints/phase5_chat_v3")
dest.mkdir(parents=True, exist_ok=True)
hf_hub_download(
repo_id="Ghostgim/GhostLM",
filename="pytorch_model.pt",
local_dir=str(dest),
)
# Rename to match what the loader expects
(dest / "pytorch_model.pt").rename(dest / "best_model.pt")
```
### Chat REPL
```bash
PYTHONPATH=. python3 scripts/chat.py \
--checkpoint checkpoints/phase5_chat_v3/best_model.pt \
--temperature 0.7 --top-k 40 --top-p 0.95 \
--repetition-penalty 1.25
```
Chat format uses three special tokens:
```
<|ghost_user|>What is XSS?<|ghost_end|>
<|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|>
```
Use the helper to build an inference-ready prompt:
```python
from ghostlm.tokenizer import GhostTokenizer
tok = GhostTokenizer()
prompt_ids = tok.format_chat_prompt([
{"role": "user", "content": "What is XSS?"},
])
# prompt_ids ends in <|ghost_assistant|> ready for generation
```
### MCP server for Claude Code / Claude Desktop
GhostLM ships an MCP server that exposes three tools — `ghostlm_query`,
`ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install
with:
```bash
claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \
--checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt
```
Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`.
## Training data
| Source | Records | License | Notes |
|---|---:|---|---|
| NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs |
| Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs |
| Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs |
| arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus |
| MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs |
| CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs |
| CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs |
| Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals |
Total chat training set: ~17,000 records after 30× small_talk oversampling
and 2× MCQ oversampling. The `data/raw/` source files and the
`data/processed/train.jsonl` pretrain corpus are reproducible from the
collector scripts in the GitHub repo.
## Limitations
- **No general world knowledge.** Outside cybersecurity the model is
wrong, repetitive, or both. It will refuse politely on most OOD
topics ("what's the weather", "tell me a joke") but accuracy on
general questions is essentially zero.
- **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates,
and technique IDs are memorized incompletely — the model often
confabulates plausible-looking but wrong specifics. Always verify
against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org),
or original vendor advisories.
- **Short coherence window.** 1024-token context, no RoPE — long
multi-turn conversations drift. The chat REPL trims old turns when
the running prompt overflows.
- **CTIBench 36.9% is well above random but well below larger models.**
This is expected at 45M parameters and 12.56M training tokens.
- **Repetition prone without penalty.** Use `--repetition-penalty 1.25`
in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops.
## Intended use
- Hands-on learning: explore how a small specialized LM behaves on a
narrow domain.
- Local cybersecurity Q&A as a complement to a larger general model
via the MCP server.
- Research baseline for cyber-LLM evaluation work — a small, fully
reproducible from-scratch model with published benchmark numbers.
**Out of scope:** production security advice, vulnerability triage,
incident response. The model is a research artifact — never act on its
output without verifying against authoritative sources.
## Citation
```
@misc{ghostlm-2026,
author = {Munene, Joe},
title = {GhostLM: A small cybersecurity language model trained from scratch},
year = {2026},
url = {https://github.com/joemunene-by/GhostLM},
}
```