| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - cybersecurity |
| - security |
| - cti |
| - mitre-attack |
| - cve |
| - chat |
| - from-scratch |
| - small-lm |
| library_name: pytorch |
| pipeline_tag: text-generation |
| model-index: |
| - name: GhostLM ghost-small chat-v3 |
| results: |
| - task: |
| type: text-classification |
| name: Multiple-choice cyber-LLM benchmark |
| dataset: |
| type: AI4Sec/cti-bench |
| name: CTIBench MCQ |
| config: cti-mcq |
| split: test |
| metrics: |
| - type: accuracy |
| value: 0.369 |
| name: accuracy (chat-v3) |
| - type: accuracy |
| value: 0.190 |
| name: accuracy (chat-v2) |
| - type: accuracy |
| value: 0.178 |
| name: accuracy (pretrain only, no chat) |
| --- |
| |
| # GhostLM |
|
|
| A small cybersecurity language model trained from scratch. Not a fine-tune |
| of an existing base — every parameter learned on a curated security |
| corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`. |
|
|
| - **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM) |
| - **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm) |
| - **License:** MIT |
|
|
| ## What this model is |
|
|
| `ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained |
| from random initialization on **12.56M tokens of cybersecurity text** — |
| NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns, |
| Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus |
| a small synthetic CTF-writeup augmentation. After 30,000 steps of |
| pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a |
| mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal |
| pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common |
| acronyms). The chat tune uses three new role tokens |
| (`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after |
| the base GPT-2 BPE vocabulary (50,261 → 50,264). |
|
|
| ## Why a 45M from-scratch model |
|
|
| A 45M model is too small to be a general-purpose assistant. The thesis is |
| specialization: a focused security corpus + targeted SFT can match or |
| beat much larger general models on narrow security tasks at a fraction |
| of the size, while running on a laptop CPU. CTIBench results below are |
| the test of that thesis. |
|
|
| ## Architecture |
|
|
| | | | |
| |---|---| |
| | Type | Decoder-only Transformer (GPT-2 family) | |
| | Layers | 6 | |
| | Hidden dim (`d_model`) | 512 | |
| | Heads | 8 (head dim 64) | |
| | FFN dim (`d_ff`) | 2048, GELU | |
| | Norm | LayerNorm, pre-norm | |
| | Positional encoding | Learned absolute | |
| | Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) | |
| | Context length | 1024 | |
| | Total params | ~45.2M | |
| | Tied input/output embeddings | yes | |
|
|
| The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the |
| codebase but disabled by default for backward compatibility with this |
| checkpoint. A `ghost-small-v0.5` preset flips them on for the next |
| pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens). |
|
|
| ## Evaluation |
|
|
| Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ** |
| (2,500 multiple-choice cyber threat-intelligence questions, scored by the |
| log-probability of A/B/C/D as the next token after `Answer:`): |
|
|
| | Checkpoint | n | Accuracy | Notes | |
| |---|---:|---:|---| |
| | ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format | |
| | ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal | |
| | ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training | |
| | **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** — 1.48× random | |
|
|
| `chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic, |
| acronym definition) to the chat training mix. The assistant turn is the |
| bare letter A/B/C/D, with a 30% subset followed by a one-line |
| justification. This teaches the model to output a single letter after |
| `Answer:` rather than continuing into prose — the dominant failure mode |
| of small models on MCQ format. |
|
|
| Honest comparisons: 36.9% is well above random (25%) but well below the |
| 85-95% that frontier models score on the same benchmark. The model was |
| trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the |
| Chinchilla-optimal data budget for 45M parameters. The next bench bump is |
| expected to come from corpus expansion (`v0.4.2`) and the v0.5 |
| architecture upgrade. |
|
|
| ## Usage |
|
|
| ### Direct use (no HF transformers integration) |
|
|
| GhostLM has a custom architecture — it does **not** use the |
| HuggingFace `transformers` library and is **not** auto-loadable via |
| `AutoModelForCausalLM`. You need the GhostLM repo itself. |
|
|
| ```bash |
| git clone https://github.com/joemunene-by/GhostLM |
| cd GhostLM |
| pip install -r requirements.txt |
| ``` |
|
|
| Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| from pathlib import Path |
| |
| dest = Path("checkpoints/phase5_chat_v3") |
| dest.mkdir(parents=True, exist_ok=True) |
| hf_hub_download( |
| repo_id="Ghostgim/GhostLM", |
| filename="pytorch_model.pt", |
| local_dir=str(dest), |
| ) |
| # Rename to match what the loader expects |
| (dest / "pytorch_model.pt").rename(dest / "best_model.pt") |
| ``` |
|
|
| ### Chat REPL |
|
|
| ```bash |
| PYTHONPATH=. python3 scripts/chat.py \ |
| --checkpoint checkpoints/phase5_chat_v3/best_model.pt \ |
| --temperature 0.7 --top-k 40 --top-p 0.95 \ |
| --repetition-penalty 1.25 |
| ``` |
|
|
| Chat format uses three special tokens: |
|
|
| ``` |
| <|ghost_user|>What is XSS?<|ghost_end|> |
| <|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|> |
| ``` |
|
|
| Use the helper to build an inference-ready prompt: |
|
|
| ```python |
| from ghostlm.tokenizer import GhostTokenizer |
| tok = GhostTokenizer() |
| prompt_ids = tok.format_chat_prompt([ |
| {"role": "user", "content": "What is XSS?"}, |
| ]) |
| # prompt_ids ends in <|ghost_assistant|> ready for generation |
| ``` |
|
|
| ### MCP server for Claude Code / Claude Desktop |
|
|
| GhostLM ships an MCP server that exposes three tools — `ghostlm_query`, |
| `ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install |
| with: |
|
|
| ```bash |
| claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \ |
| --checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt |
| ``` |
|
|
| Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`. |
|
|
| ## Training data |
|
|
| | Source | Records | License | Notes | |
| |---|---:|---|---| |
| | NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs | |
| | Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs | |
| | Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs | |
| | arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus | |
| | MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs | |
| | CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs | |
| | CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs | |
| | Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals | |
| |
| Total chat training set: ~17,000 records after 30× small_talk oversampling |
| and 2× MCQ oversampling. The `data/raw/` source files and the |
| `data/processed/train.jsonl` pretrain corpus are reproducible from the |
| collector scripts in the GitHub repo. |
|
|
| ## Limitations |
|
|
| - **No general world knowledge.** Outside cybersecurity the model is |
| wrong, repetitive, or both. It will refuse politely on most OOD |
| topics ("what's the weather", "tell me a joke") but accuracy on |
| general questions is essentially zero. |
| - **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates, |
| and technique IDs are memorized incompletely — the model often |
| confabulates plausible-looking but wrong specifics. Always verify |
| against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org), |
| or original vendor advisories. |
| - **Short coherence window.** 1024-token context, no RoPE — long |
| multi-turn conversations drift. The chat REPL trims old turns when |
| the running prompt overflows. |
| - **CTIBench 36.9% is well above random but well below larger models.** |
| This is expected at 45M parameters and 12.56M training tokens. |
| - **Repetition prone without penalty.** Use `--repetition-penalty 1.25` |
| in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops. |
|
|
| ## Intended use |
|
|
| - Hands-on learning: explore how a small specialized LM behaves on a |
| narrow domain. |
| - Local cybersecurity Q&A as a complement to a larger general model |
| via the MCP server. |
| - Research baseline for cyber-LLM evaluation work — a small, fully |
| reproducible from-scratch model with published benchmark numbers. |
|
|
| **Out of scope:** production security advice, vulnerability triage, |
| incident response. The model is a research artifact — never act on its |
| output without verifying against authoritative sources. |
|
|
| ## Citation |
|
|
| ``` |
| @misc{ghostlm-2026, |
| author = {Munene, Joe}, |
| title = {GhostLM: A small cybersecurity language model trained from scratch}, |
| year = {2026}, |
| url = {https://github.com/joemunene-by/GhostLM}, |
| } |
| ``` |
|
|