--- license: mit language: - en tags: - cybersecurity - security - cti - mitre-attack - cve - chat - from-scratch - small-lm library_name: pytorch pipeline_tag: text-generation model-index: - name: GhostLM ghost-small chat-v3 results: - task: type: text-classification name: Multiple-choice cyber-LLM benchmark dataset: type: AI4Sec/cti-bench name: CTIBench MCQ config: cti-mcq split: test metrics: - type: accuracy value: 0.369 name: accuracy (chat-v3) - type: accuracy value: 0.190 name: accuracy (chat-v2) - type: accuracy value: 0.178 name: accuracy (pretrain only, no chat) --- # GhostLM A small cybersecurity language model trained from scratch. Not a fine-tune of an existing base — every parameter learned on a curated security corpus. Currently shipping the v0.5.0 chat-tuned variant of `ghost-small`. - **Repository:** [github.com/joemunene-by/GhostLM](https://github.com/joemunene-by/GhostLM) - **Live demo:** [Ghostgim/ghostlm Space](https://huggingface.co/spaces/Ghostgim/ghostlm) - **License:** MIT ## What this model is `ghost-small` chat-v3 is a 45M-parameter decoder-only transformer trained from random initialization on **12.56M tokens of cybersecurity text** — NVD CVE descriptions, MITRE ATT&CK techniques, MITRE CAPEC patterns, Exploit-DB entries, CTFtime writeups, arXiv cs.CR security research, plus a small synthetic CTF-writeup augmentation. After 30,000 steps of pretraining (`Phase 4`), it was supervised-fine-tuned for chat with a mix of free-form Q&A, hand-written small-talk / identity / OOD-refusal pairs, and templated MCQ examples (NVD CWE-class, MITRE tactic, common acronyms). The chat tune uses three new role tokens (`<|ghost_user|>`, `<|ghost_assistant|>`, `<|ghost_end|>`) appended after the base GPT-2 BPE vocabulary (50,261 → 50,264). ## Why a 45M from-scratch model A 45M model is too small to be a general-purpose assistant. The thesis is specialization: a focused security corpus + targeted SFT can match or beat much larger general models on narrow security tasks at a fraction of the size, while running on a laptop CPU. CTIBench results below are the test of that thesis. ## Architecture | | | |---|---| | Type | Decoder-only Transformer (GPT-2 family) | | Layers | 6 | | Hidden dim (`d_model`) | 512 | | Heads | 8 (head dim 64) | | FFN dim (`d_ff`) | 2048, GELU | | Norm | LayerNorm, pre-norm | | Positional encoding | Learned absolute | | Vocabulary | 50,264 (GPT-2 BPE + 7 specials, incl. 3 chat-role tokens) | | Context length | 1024 | | Total params | ~45.2M | | Tied input/output embeddings | yes | The v0.5 architecture upgrade (RoPE / SwiGLU / RMSNorm) is wired in the codebase but disabled by default for backward compatibility with this checkpoint. A `ghost-small-v0.5` preset flips them on for the next pretrain run, gated on the v0.4.2 corpus expansion (~50M tokens). ## Evaluation Scored on **[CTIBench](https://huggingface.co/datasets/AI4Sec/cti-bench) MCQ** (2,500 multiple-choice cyber threat-intelligence questions, scored by the log-probability of A/B/C/D as the next token after `Answer:`): | Checkpoint | n | Accuracy | Notes | |---|---:|---:|---| | ghost-small Phase 4 (pretrain only) | 2500 | **17.8%** | below random — completion model, doesn't follow MCQ format | | ghost-small chat-v2 (free-form chat tune) | 2500 | **19.0%** | identity + OOD refusal works, no MCQ-format signal | | ghost-small chat-v2 + RAG (top-4) | 2500 | **19.0%** | retrieval is neutral without RAFT-style training | | **ghost-small chat-v3 (MCQ-tuned)** | 2500 | **36.9%** | **canonical** — 1.48× random | `chat-v3` adds 1,802 templated MCQ examples (NVD CWE-class, MITRE tactic, acronym definition) to the chat training mix. The assistant turn is the bare letter A/B/C/D, with a 30% subset followed by a one-line justification. This teaches the model to output a single letter after `Answer:` rather than continuing into prose — the dominant failure mode of small models on MCQ format. Honest comparisons: 36.9% is well above random (25%) but well below the 85-95% that frontier models score on the same benchmark. The model was trained on 12.56M tokens of pure cybersecurity text — about 1.4% of the Chinchilla-optimal data budget for 45M parameters. The next bench bump is expected to come from corpus expansion (`v0.4.2`) and the v0.5 architecture upgrade. ## Usage ### Direct use (no HF transformers integration) GhostLM has a custom architecture — it does **not** use the HuggingFace `transformers` library and is **not** auto-loadable via `AutoModelForCausalLM`. You need the GhostLM repo itself. ```bash git clone https://github.com/joemunene-by/GhostLM cd GhostLM pip install -r requirements.txt ``` Then download the weights from this Hub repo into `checkpoints/phase5_chat_v3/`: ```python from huggingface_hub import hf_hub_download from pathlib import Path dest = Path("checkpoints/phase5_chat_v3") dest.mkdir(parents=True, exist_ok=True) hf_hub_download( repo_id="Ghostgim/GhostLM", filename="pytorch_model.pt", local_dir=str(dest), ) # Rename to match what the loader expects (dest / "pytorch_model.pt").rename(dest / "best_model.pt") ``` ### Chat REPL ```bash PYTHONPATH=. python3 scripts/chat.py \ --checkpoint checkpoints/phase5_chat_v3/best_model.pt \ --temperature 0.7 --top-k 40 --top-p 0.95 \ --repetition-penalty 1.25 ``` Chat format uses three special tokens: ``` <|ghost_user|>What is XSS?<|ghost_end|> <|ghost_assistant|>Cross-Site Scripting is a vulnerability where ...<|ghost_end|> ``` Use the helper to build an inference-ready prompt: ```python from ghostlm.tokenizer import GhostTokenizer tok = GhostTokenizer() prompt_ids = tok.format_chat_prompt([ {"role": "user", "content": "What is XSS?"}, ]) # prompt_ids ends in <|ghost_assistant|> ready for generation ``` ### MCP server for Claude Code / Claude Desktop GhostLM ships an MCP server that exposes three tools — `ghostlm_query`, `ghostlm_explain_cve`, `ghostlm_map_to_attack` — over stdio. Install with: ```bash claude mcp add ghostlm -- python3 /absolute/path/to/GhostLM/scripts/mcp_server.py \ --checkpoint /absolute/path/to/checkpoints/phase5_chat_v3/best_model.pt ``` Requires Python ≥ 3.10 + `pip install mcp torch tiktoken`. ## Training data | Source | Records | License | Notes | |---|---:|---|---| | NVD CVE descriptions | 64,559 | Public domain (NIST) | sampled to ~3,500 chat Q&A + 1,000 MCQ pairs | | Exploit-DB | 4,711 | GPLv2-compatible | sampled to 2,000 chat pairs | | Synthetic CTF writeups | 2,847 | Custom — generated locally | turned into 2,847 chat pairs | | arXiv cs.CR | 1,890 | arXiv non-exclusive (per paper) | abstracts only in v0.4 corpus | | MITRE ATT&CK | 655 | Apache 2.0 | all 655 chat pairs + 655 MCQs | | CAPEC | 563 | Public domain (CISA / MITRE) | all 563 chat pairs | | CTFtime writeups | 451 | per-author (mostly CC) | all 451 chat pairs | | Hand-written small_talk | 153 | MIT | greetings / identity / OOD refusals | Total chat training set: ~17,000 records after 30× small_talk oversampling and 2× MCQ oversampling. The `data/raw/` source files and the `data/processed/train.jsonl` pretrain corpus are reproducible from the collector scripts in the GitHub repo. ## Limitations - **No general world knowledge.** Outside cybersecurity the model is wrong, repetitive, or both. It will refuse politely on most OOD topics ("what's the weather", "tell me a joke") but accuracy on general questions is essentially zero. - **Specific facts unreliable.** Exact CVE numbers, CVSS scores, dates, and technique IDs are memorized incompletely — the model often confabulates plausible-looking but wrong specifics. Always verify against [NVD](https://nvd.nist.gov), [MITRE ATT&CK](https://attack.mitre.org), or original vendor advisories. - **Short coherence window.** 1024-token context, no RoPE — long multi-turn conversations drift. The chat REPL trims old turns when the running prompt overflows. - **CTIBench 36.9% is well above random but well below larger models.** This is expected at 45M parameters and 12.56M training tokens. - **Repetition prone without penalty.** Use `--repetition-penalty 1.25` in `chat.py` (the default) to avoid "Wifi Wifi Wifi…" loops. ## Intended use - Hands-on learning: explore how a small specialized LM behaves on a narrow domain. - Local cybersecurity Q&A as a complement to a larger general model via the MCP server. - Research baseline for cyber-LLM evaluation work — a small, fully reproducible from-scratch model with published benchmark numbers. **Out of scope:** production security advice, vulnerability triage, incident response. The model is a research artifact — never act on its output without verifying against authoritative sources. ## Citation ``` @misc{ghostlm-2026, author = {Munene, Joe}, title = {GhostLM: A small cybersecurity language model trained from scratch}, year = {2026}, url = {https://github.com/joemunene-by/GhostLM}, } ```