YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
CVE-BERT-DMSV: CVE Embedding Model using SentenceTransformers
Model Card for sushanrai/CVE_BERT_DMSV
- Model Name: CVE-BERT-DMSV
- Model Type: SentenceTransformer for semantic search over CVE descriptions
- Base Model:
google-bert/bert-base-uncased - Training Framework: SentenceTransformers
- Fine-tuned By: HACK_DMSV
- License: Apache 2.0
Model Description
This model is a fine-tuned version of google-bert/bert-base-uncased using the SentenceTransformers framework. It is trained on Common Vulnerabilities and Exposures (CVE) data for semantic search and similarity tasks. The model maps CVE descriptions into dense vector embeddings to facilitate information retrieval, similarity detection, and clustering.
Intended Use
- Semantic search on CVE descriptions
- Finding similar vulnerabilities
- Classifying new security issues based on semantic similarity
- Recommending related vulnerabilities or CVEs
Input Format: A natural language query or CVE description sentence Output Format: 768-dimensional dense vector embedding
Example Usage (Python)
import torch
from sentence_transformers import SentenceTransformer, util
# Load the model and embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("sushanrai/CVE_BERT_DMSV", device=device)
data = torch.load("cve_embeddings.pt")
cve_embeddings = data["embeddings"]
cve_texts = data["cve_texts"]
# Encode a query
query = "buffer overflow in FTP server"
query_embedding = model.encode(query, convert_to_tensor=True)
# Semantic search
cos_scores = util.pytorch_cos_sim(query_embedding, cve_embeddings)[0]
top_results = torch.topk(cos_scores, k=5)
for score, idx in zip(top_results.values, top_results.indices):
print(f"CVE: {cve_texts[idx]}, Score: {score:.4f}")
Training Data
The model was trained using CVE description texts curated from the NVD (National Vulnerability Database). The training samples consist of similar and dissimilar CVE pairs, designed to teach the model to distinguish relevant vulnerabilities.
Training Objective
Fine-tuning was done using the MultipleNegativesRankingLoss, a contrastive loss suitable for semantic search and retrieval tasks. This enables the model to learn meaningful vector representations that place similar descriptions closer in vector space.
Evaluation
The model has been tested with various security-related queries, and shows high relevance in top-k matches (e.g., k=5). In the example below, a query about "buffer overflow in FTP server" returned:
- FTP bounce attack CVE โ Score: 0.8083
- getcwd() descriptor leak โ Score: 0.7683
- FTP PASV DoS โ Score: 0.7493
Limitations
- The model may not generalize well outside of CVE/NVD-style data.
- Embeddings should be periodically updated as new CVEs are introduced.
- Sensitive to spelling or grammar errors in the input.
Acknowledgements
Citation
@misc{sushanrai2025cvebert,
title={CVE-BERT-DMSV: A SentenceTransformer Model for Semantic Search over CVEs},
author={HACKDMSV},
year={2025},
url={https://huggingface.co/sushanrai/CVE_BERT_DMSV}
}
Tags
sentence-transformers cve cybersecurity semantic-search bert vulnerability
- Downloads last month
- -