CVE-BERT-DMSV: CVE Embedding Model using SentenceTransformers

Model Card for `sushanrai/CVE_BERT_DMSV`

Model Name: CVE-BERT-DMSV
Model Type: SentenceTransformer for semantic search over CVE descriptions
Base Model: google-bert/bert-base-uncased
Training Framework: SentenceTransformers
Fine-tuned By: HACK_DMSV
License: Apache 2.0

Model Description

This model is a fine-tuned version of google-bert/bert-base-uncased using the SentenceTransformers framework. It is trained on Common Vulnerabilities and Exposures (CVE) data for semantic search and similarity tasks. The model maps CVE descriptions into dense vector embeddings to facilitate information retrieval, similarity detection, and clustering.

Intended Use

Semantic search on CVE descriptions
Finding similar vulnerabilities
Classifying new security issues based on semantic similarity
Recommending related vulnerabilities or CVEs

Input Format: A natural language query or CVE description sentence Output Format: 768-dimensional dense vector embedding

Example Usage (Python)

import torch
from sentence_transformers import SentenceTransformer, util

# Load the model and embeddings
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("sushanrai/CVE_BERT_DMSV", device=device)
data = torch.load("cve_embeddings.pt")
cve_embeddings = data["embeddings"]
cve_texts = data["cve_texts"]

# Encode a query
query = "buffer overflow in FTP server"
query_embedding = model.encode(query, convert_to_tensor=True)

# Semantic search
cos_scores = util.pytorch_cos_sim(query_embedding, cve_embeddings)[0]
top_results = torch.topk(cos_scores, k=5)

for score, idx in zip(top_results.values, top_results.indices):
    print(f"CVE: {cve_texts[idx]}, Score: {score:.4f}")

Training Data

The model was trained using CVE description texts curated from the NVD (National Vulnerability Database). The training samples consist of similar and dissimilar CVE pairs, designed to teach the model to distinguish relevant vulnerabilities.

Training Objective

Fine-tuning was done using the MultipleNegativesRankingLoss, a contrastive loss suitable for semantic search and retrieval tasks. This enables the model to learn meaningful vector representations that place similar descriptions closer in vector space.

Evaluation

The model has been tested with various security-related queries, and shows high relevance in top-k matches (e.g., k=5). In the example below, a query about "buffer overflow in FTP server" returned:

FTP bounce attack CVE — Score: 0.8083
getcwd() descriptor leak — Score: 0.7683
FTP PASV DoS — Score: 0.7493

Limitations

The model may not generalize well outside of CVE/NVD-style data.
Embeddings should be periodically updated as new CVEs are introduced.
Sensitive to spelling or grammar errors in the input.

Acknowledgements

Citation

@misc{sushanrai2025cvebert,
  title={CVE-BERT-DMSV: A SentenceTransformer Model for Semantic Search over CVEs},
  author={HACKDMSV},
  year={2025},
  url={https://huggingface.co/sushanrai/CVE_BERT_DMSV}
}

sushanrai
/

CVE_BERT_DMSV

CVE-BERT-DMSV: CVE Embedding Model using SentenceTransformers

Model Card for `sushanrai/CVE_BERT_DMSV`

Model Description

Intended Use

Example Usage (Python)

Training Data

Training Objective

Evaluation

Limitations

Acknowledgements

Citation

Tags

CVE-BERT-DMSV: CVE Embedding Model using SentenceTransformers

Model Card for sushanrai/CVE_BERT_DMSV

Model Description

Intended Use

Example Usage (Python)

Training Data

Training Objective

Evaluation

Limitations

Acknowledgements

Citation

Tags

Model Card for `sushanrai/CVE_BERT_DMSV`