YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

CVE-BERT-DMSV: CVE Embedding Model using SentenceTransformers

Model Card for sushanrai/CVE_BERT_DMSV


Model Description

This model is a fine-tuned version of google-bert/bert-base-uncased using the SentenceTransformers framework. It is trained on Common Vulnerabilities and Exposures (CVE) data for semantic search and similarity tasks. The model maps CVE descriptions into dense vector embeddings to facilitate information retrieval, similarity detection, and clustering.


Intended Use

  • Semantic search on CVE descriptions
  • Finding similar vulnerabilities
  • Classifying new security issues based on semantic similarity
  • Recommending related vulnerabilities or CVEs

Input Format: A natural language query or CVE description sentence Output Format: 768-dimensional dense vector embedding


Example Usage (Python)

import torch
from sentence_transformers import SentenceTransformer, util

# Load the model and embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("sushanrai/CVE_BERT_DMSV", device=device)
data = torch.load("cve_embeddings.pt")
cve_embeddings = data["embeddings"]
cve_texts = data["cve_texts"]

# Encode a query
query = "buffer overflow in FTP server"
query_embedding = model.encode(query, convert_to_tensor=True)

# Semantic search
cos_scores = util.pytorch_cos_sim(query_embedding, cve_embeddings)[0]
top_results = torch.topk(cos_scores, k=5)

for score, idx in zip(top_results.values, top_results.indices):
    print(f"CVE: {cve_texts[idx]}, Score: {score:.4f}")

Training Data

The model was trained using CVE description texts curated from the NVD (National Vulnerability Database). The training samples consist of similar and dissimilar CVE pairs, designed to teach the model to distinguish relevant vulnerabilities.


Training Objective

Fine-tuning was done using the MultipleNegativesRankingLoss, a contrastive loss suitable for semantic search and retrieval tasks. This enables the model to learn meaningful vector representations that place similar descriptions closer in vector space.


Evaluation

The model has been tested with various security-related queries, and shows high relevance in top-k matches (e.g., k=5). In the example below, a query about "buffer overflow in FTP server" returned:

  1. FTP bounce attack CVE โ€” Score: 0.8083
  2. getcwd() descriptor leak โ€” Score: 0.7683
  3. FTP PASV DoS โ€” Score: 0.7493

Limitations

  • The model may not generalize well outside of CVE/NVD-style data.
  • Embeddings should be periodically updated as new CVEs are introduced.
  • Sensitive to spelling or grammar errors in the input.

Acknowledgements


Citation

@misc{sushanrai2025cvebert,
  title={CVE-BERT-DMSV: A SentenceTransformer Model for Semantic Search over CVEs},
  author={HACKDMSV},
  year={2025},
  url={https://huggingface.co/sushanrai/CVE_BERT_DMSV}
}

Tags

sentence-transformers cve cybersecurity semantic-search bert vulnerability

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support