cybersec-mpnet-finetuned

A cybersecurity-domain sentence embedding model fine-tuned from sentence-transformers/all-mpnet-base-v2 on 2,179 CTI-Bench training pairs from two sources:

  • CTI-ATE β€” real threat intelligence report excerpts mapped to MITRE ATT&CK technique descriptions (from the STIX bundle)
  • CTI-MCQ β€” 2,500 cybersecurity MCQ questions covering ATT&CK, CWE, and CAPEC; 500 held out as evaluation set (before training, to prevent data leakage), 2,000 used for training

Built as part of an Agentic Cybersecurity Threat Intelligence & Incident Response Advisor capstone project β€” a LangGraph-based pipeline that ingests raw security logs, enriches IoCs via threat feeds (AbuseIPDB, VirusTotal, ThreatFox), maps to MITRE ATT&CK, and generates grounded incident response playbooks.

Requirements

sentence-transformers>=5.5.1

This model was trained with sentence-transformers 5.5.1. Older versions will fail to load due to internal module path changes introduced in 5.5.x. If you see a ModuleNotFoundError: No module named 'sentence_transformers.base' error, upgrade:

pip install "sentence-transformers>=5.5.1"

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sindsub/cybersec-mpnet-finetuned")

# Encode cybersecurity text
embeddings = model.encode([
    "T1059.001 - PowerShell command execution via encoded payload",
    "Adversary used scheduled task for persistence after initial access",
    "CVE-2024-1234: Remote code execution in Apache HTTP Server",
])

# Semantic similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity)  # Higher for semantically related CTI text

Intended Use

  • RAG retrieval over MITRE ATT&CK, CISA KEV, NVD CVE, and NIST SP 800-61 knowledge bases
  • Semantic search across threat intelligence documents
  • RAGAS-style faithfulness evaluation β€” cosine similarity between LLM-generated playbook steps and their cited source passages
  • IoC and technique clustering in threat correlation pipelines

Training Details

Parameter Value
Base model sentence-transformers/all-mpnet-base-v2
Training set CTI-Bench β€” 2,179 QA pairs
Domains MITRE ATT&CK, CVE/CWE, CAPEC, threat intelligence
Loss function MultipleNegativesRankingLoss (MNRL)
Epochs 4
Batch size 16
Training loss 0.41 β†’ 0.08
Hardware CPU-only

Evaluation Results

Self-retrieval proxy evaluation across four ChromaDB collections (25 chunks sampled per collection, query = chunk's own text):

Collection Hit@1 Hit@3 Hit@5
MITRE ATT&CK techniques 100% 100% 100%
NVD CVE records 0% 100% 100%
CISA KEV entries 0% 100% 100%
NIST SP 800-61 procedures 0% 0% 100%
Overall 25% 75% 100%

Hit@5 = 1.000 β€” exceeds the β‰₯ 0.90 target. Runtime: 100 queries, CPU-only.

RAGAS-style faithfulness (cosine similarity between generated playbook step embeddings and cited source embeddings): 0.548 β€” a significant improvement over the 0.143 Jaccard score of the base model on the same data.

Model Card

  • Developed by: Sindhu S
  • Model type: Sentence Transformer (bi-encoder)
  • Language: English
  • License: Apache 2.0
  • Parameters: 109M (inherited from all-mpnet-base-v2)
  • Size: ~418 MB (safetensors)

Limitations

  • Fine-tuned on CTI-Bench which is primarily MITRE ATT&CK / CVE focused β€” may not generalise as well to other security domains (e.g., physical security, fraud)
  • Training was CPU-only with a small batch size; GPU training with larger batches and more epochs may improve quality further
  • Hit@1 is low (25%) for collections with dense, similar entries β€” a reranker is recommended for precision-critical applications
Downloads last month
41
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sindsub/cybersec-mpnet-finetuned

Finetuned
(381)
this model