Instructions to use sindsub/cybersec-mpnet-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sindsub/cybersec-mpnet-finetuned with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sindsub/cybersec-mpnet-finetuned") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
cybersec-mpnet-finetuned
A cybersecurity-domain sentence embedding model fine-tuned from sentence-transformers/all-mpnet-base-v2 on 2,179 CTI-Bench training pairs from two sources:
- CTI-ATE β real threat intelligence report excerpts mapped to MITRE ATT&CK technique descriptions (from the STIX bundle)
- CTI-MCQ β 2,500 cybersecurity MCQ questions covering ATT&CK, CWE, and CAPEC; 500 held out as evaluation set (before training, to prevent data leakage), 2,000 used for training
Built as part of an Agentic Cybersecurity Threat Intelligence & Incident Response Advisor capstone project β a LangGraph-based pipeline that ingests raw security logs, enriches IoCs via threat feeds (AbuseIPDB, VirusTotal, ThreatFox), maps to MITRE ATT&CK, and generates grounded incident response playbooks.
Requirements
sentence-transformers>=5.5.1
This model was trained with sentence-transformers 5.5.1. Older versions will fail to load due to internal module path changes introduced in 5.5.x. If you see a ModuleNotFoundError: No module named 'sentence_transformers.base' error, upgrade:
pip install "sentence-transformers>=5.5.1"
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sindsub/cybersec-mpnet-finetuned")
# Encode cybersecurity text
embeddings = model.encode([
"T1059.001 - PowerShell command execution via encoded payload",
"Adversary used scheduled task for persistence after initial access",
"CVE-2024-1234: Remote code execution in Apache HTTP Server",
])
# Semantic similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity) # Higher for semantically related CTI text
Intended Use
- RAG retrieval over MITRE ATT&CK, CISA KEV, NVD CVE, and NIST SP 800-61 knowledge bases
- Semantic search across threat intelligence documents
- RAGAS-style faithfulness evaluation β cosine similarity between LLM-generated playbook steps and their cited source passages
- IoC and technique clustering in threat correlation pipelines
Training Details
| Parameter | Value |
|---|---|
| Base model | sentence-transformers/all-mpnet-base-v2 |
| Training set | CTI-Bench β 2,179 QA pairs |
| Domains | MITRE ATT&CK, CVE/CWE, CAPEC, threat intelligence |
| Loss function | MultipleNegativesRankingLoss (MNRL) |
| Epochs | 4 |
| Batch size | 16 |
| Training loss | 0.41 β 0.08 |
| Hardware | CPU-only |
Evaluation Results
Self-retrieval proxy evaluation across four ChromaDB collections (25 chunks sampled per collection, query = chunk's own text):
| Collection | Hit@1 | Hit@3 | Hit@5 |
|---|---|---|---|
| MITRE ATT&CK techniques | 100% | 100% | 100% |
| NVD CVE records | 0% | 100% | 100% |
| CISA KEV entries | 0% | 100% | 100% |
| NIST SP 800-61 procedures | 0% | 0% | 100% |
| Overall | 25% | 75% | 100% |
Hit@5 = 1.000 β exceeds the β₯ 0.90 target. Runtime: 100 queries, CPU-only.
RAGAS-style faithfulness (cosine similarity between generated playbook step embeddings and cited source embeddings): 0.548 β a significant improvement over the 0.143 Jaccard score of the base model on the same data.
Model Card
- Developed by: Sindhu S
- Model type: Sentence Transformer (bi-encoder)
- Language: English
- License: Apache 2.0
- Parameters: 109M (inherited from
all-mpnet-base-v2) - Size: ~418 MB (safetensors)
Limitations
- Fine-tuned on CTI-Bench which is primarily MITRE ATT&CK / CVE focused β may not generalise as well to other security domains (e.g., physical security, fraud)
- Training was CPU-only with a small batch size; GPU training with larger batches and more epochs may improve quality further
- Hit@1 is low (25%) for collections with dense, similar entries β a reranker is recommended for precision-critical applications
- Downloads last month
- 41
Model tree for sindsub/cybersec-mpnet-finetuned
Base model
sentence-transformers/all-mpnet-base-v2