File size: 9,096 Bytes

---
license: mit
language:
- en
tags:
- cybersecurity
- security
- cti
- mitre-attack
- cve
- chat
- from-scratch
- small-lm
library_name: pytorch
pipeline_tag: text-generation
model-index:
- name: GhostLM ghost-small chat-v3
  results:
  - task:
      type: text-classification
      name: Multiple-choice cyber-LLM benchmark
    dataset:
      type: AI4Sec/cti-bench
      name: CTIBench MCQ
      config: cti-mcq
      split: test
    metrics:
    - type: accuracy
      value: 0.369
      name: accuracy (chat-v3)
    - type: accuracy
      value: 0.190
      name: accuracy (chat-v2)
    - type: accuracy
      value: 0.178
      name: accuracy (pretrain only, no chat)
---

# GhostLM

A small cybersecurity language model trained from scratch. Not a fine-tune
of an existing base — every parameter learned on a curated security
corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`.

- **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM)
- **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm)
- **License:** MIT

## What this model is

`ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained
from random initialization on **12.56M tokens of cybersecurity text** —
NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns,
Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus
a small synthetic CTF-writeup augmentation. After 30,000 steps of
pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a
mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal
pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common
acronyms). The chat tune uses three new role tokens
(`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after
the base GPT-2 BPE vocabulary (50,261 → 50,264).

## Why a 45M from-scratch model

A 45M model is too small to be a general-purpose assistant. The thesis is
specialization: a focused security corpus + targeted SFT can match or
beat much larger general models on narrow security tasks at a fraction
of the size, while running on a laptop CPU. CTIBench results below are
the test of that thesis.

## Architecture

| | |
|---|---|
| Type | Decoder-only Transformer (GPT-2 family) |
| Layers | 6 |
| Hidden dim (`d_model`) | 512 |
| Heads | 8 (head dim 64) |
| FFN dim (`d_ff`) | 2048, GELU |
| Norm | LayerNorm, pre-norm |
| Positional encoding | Learned absolute |
| Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) |
| Context length | 1024 |
| Total params | ~45.2M |
| Tied input/output embeddings | yes |

The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the
codebase but disabled by default for backward compatibility with this
checkpoint. A `ghost-small-v0.5` preset flips them on for the next
pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens).

## Evaluation

Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ**
(2,500 multiple-choice cyber threat-intelligence questions, scored by the
log-probability of A/B/C/D as the next token after `Answer:`):

| Checkpoint | n | Accuracy | Notes |
|---|---:|---:|---|
| ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format |
| ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal |
| ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training |
| **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** — 1.48× random |

`chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic,
acronym definition) to the chat training mix. The assistant turn is the
bare letter A/B/C/D, with a 30% subset followed by a one-line
justification. This teaches the model to output a single letter after
`Answer:` rather than continuing into prose — the dominant failure mode
of small models on MCQ format.

Honest comparisons: 36.9% is well above random (25%) but well below the
85-95% that frontier models score on the same benchmark. The model was
trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the
Chinchilla-optimal data budget for 45M parameters. The next bench bump is
expected to come from corpus expansion (`v0.4.2`) and the v0.5
architecture upgrade.

## Usage

### Direct use (no HF transformers integration)

GhostLM has a custom architecture — it does **not** use the
HuggingFace `transformers` library and is **not** auto-loadable via
`AutoModelForCausalLM`. You need the GhostLM repo itself.

```bash
git clone https://github.com/joemunene-by/GhostLM
cd GhostLM
pip install -r requirements.txt
```

Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`:

```python
from huggingface_hub import hf_hub_download
from pathlib import Path

dest = Path("checkpoints/phase5_chat_v3")
dest.mkdir(parents=True, exist_ok=True)
hf_hub_download(
    repo_id="Ghostgim/GhostLM",
    filename="pytorch_model.pt",
    local_dir=str(dest),
)
# Rename to match what the loader expects
(dest / "pytorch_model.pt").rename(dest / "best_model.pt")
```

### Chat REPL

```bash
PYTHONPATH=. python3 scripts/chat.py \
    --checkpoint checkpoints/phase5_chat_v3/best_model.pt \
    --temperature 0.7 --top-k 40 --top-p 0.95 \
    --repetition-penalty 1.25
```

Chat format uses three special tokens:

```
<|ghost_user|>What is XSS?<|ghost_end|>
<|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|>
```

Use the helper to build an inference-ready prompt:

```python
from ghostlm.tokenizer import GhostTokenizer
tok = GhostTokenizer()
prompt_ids = tok.format_chat_prompt([
    {"role": "user", "content": "What is XSS?"},
])
# prompt_ids ends in <|ghost_assistant|> ready for generation
```

### MCP server for Claude Code / Claude Desktop

GhostLM ships an MCP server that exposes three tools — `ghostlm_query`,
`ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install
with:

```bash
claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \
    --checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt
```

Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`.

## Training data

| Source | Records | License | Notes |
|---|---:|---|---|
| NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs |
| Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs |
| Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs |
| arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus |
| MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs |
| CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs |
| CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs |
| Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals |

Total chat training set: ~17,000 records after 30× small_talk oversampling
and 2× MCQ oversampling. The `data/raw/` source files and the
`data/processed/train.jsonl` pretrain corpus are reproducible from the
collector scripts in the GitHub repo.

## Limitations

- **No general world knowledge.** Outside cybersecurity the model is
  wrong, repetitive, or both. It will refuse politely on most OOD
  topics ("what's the weather", "tell me a joke") but accuracy on
  general questions is essentially zero.
- **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates,
  and technique IDs are memorized incompletely — the model often
  confabulates plausible-looking but wrong specifics. Always verify
  against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org),
  or original vendor advisories.
- **Short coherence window.** 1024-token context, no RoPE — long
  multi-turn conversations drift. The chat REPL trims old turns when
  the running prompt overflows.
- **CTIBench 36.9% is well above random but well below larger models.**
  This is expected at 45M parameters and 12.56M training tokens.
- **Repetition prone without penalty.** Use `--repetition-penalty 1.25`
  in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops.

## Intended use

- Hands-on learning: explore how a small specialized LM behaves on a
  narrow domain.
- Local cybersecurity Q&A as a complement to a larger general model
  via the MCP server.
- Research baseline for cyber-LLM evaluation work — a small, fully
  reproducible from-scratch model with published benchmark numbers.

**Out of scope:** production security advice, vulnerability triage,
incident response. The model is a research artifact — never act on its
output without verifying against authoritative sources.

## Citation

```
@misc{ghostlm-2026,
  author = {Munene, Joe},
  title  = {GhostLM: A small cybersecurity language model trained from scratch},
  year   = {2026},
  url    = {https://github.com/joemunene-by/GhostLM},
}
```