cybersec-mpnet-finetuned

A cybersecurity-domain sentence embedding model fine-tuned from sentence-transformers/all-mpnet-base-v2 on 2,179 CTI-Bench training pairs from two sources:

CTI-ATE — real threat intelligence report excerpts mapped to MITRE ATT&CK technique descriptions (from the STIX bundle)
CTI-MCQ — 2,500 cybersecurity MCQ questions covering ATT&CK, CWE, and CAPEC; 500 held out as evaluation set (before training, to prevent data leakage), 2,000 used for training

Built as part of an Agentic Cybersecurity Threat Intelligence & Incident Response Advisor capstone project — a LangGraph-based pipeline that ingests raw security logs, enriches IoCs via threat feeds (AbuseIPDB, VirusTotal, ThreatFox), maps to MITRE ATT&CK, and generates grounded incident response playbooks.

Requirements

sentence-transformers>=5.5.1

This model was trained with sentence-transformers 5.5.1. Older versions will fail to load due to internal module path changes introduced in 5.5.x. If you see a ModuleNotFoundError: No module named 'sentence_transformers.base' error, upgrade:

pip install "sentence-transformers>=5.5.1"

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sindsub/cybersec-mpnet-finetuned")

# Encode cybersecurity text
embeddings = model.encode([
    "T1059.001 - PowerShell command execution via encoded payload",
    "Adversary used scheduled task for persistence after initial access",
    "CVE-2024-1234: Remote code execution in Apache HTTP Server",
])

# Semantic similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity)  # Higher for semantically related CTI text

Intended Use

RAG retrieval over MITRE ATT&CK, CISA KEV, NVD CVE, and NIST SP 800-61 knowledge bases
Semantic search across threat intelligence documents
RAGAS-style faithfulness evaluation — cosine similarity between LLM-generated playbook steps and their cited source passages
IoC and technique clustering in threat correlation pipelines

Training Details

Parameter	Value
Base model	`sentence-transformers/all-mpnet-base-v2`
Training set	CTI-Bench — 2,179 QA pairs
Domains	MITRE ATT&CK, CVE/CWE, CAPEC, threat intelligence
Loss function	MultipleNegativesRankingLoss (MNRL)
Epochs	4
Batch size	16
Training loss	0.41 → 0.08
Hardware	CPU-only

Evaluation Results

Self-retrieval proxy evaluation across four ChromaDB collections (25 chunks sampled per collection, query = chunk's own text):

Collection	Hit@1	Hit@3	Hit@5
MITRE ATT&CK techniques	100%	100%	100%
NVD CVE records	0%	100%	100%
CISA KEV entries	0%	100%	100%
NIST SP 800-61 procedures	0%	0%	100%
Overall	25%	75%	100%

Hit@5 = 1.000 — exceeds the ≥ 0.90 target. Runtime: 100 queries, CPU-only.

RAGAS-style faithfulness (cosine similarity between generated playbook step embeddings and cited source embeddings): 0.548 — a significant improvement over the 0.143 Jaccard score of the base model on the same data.

Model Card

Developed by: Sindhu S
Model type: Sentence Transformer (bi-encoder)
Language: English
License: Apache 2.0
Parameters: 109M (inherited from all-mpnet-base-v2)
Size: ~418 MB (safetensors)

Limitations

Fine-tuned on CTI-Bench which is primarily MITRE ATT&CK / CVE focused — may not generalise as well to other security domains (e.g., physical security, fraud)
Training was CPU-only with a small batch size; GPU training with larger batches and more epochs may improve quality further
Hit@1 is low (25%) for collections with dense, similar entries — a reranker is recommended for precision-critical applications

Downloads last month: 41

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sindsub/cybersec-mpnet-finetuned

Base model

sentence-transformers/all-mpnet-base-v2

Finetuned

(381)

this model