DeBERTa-v3-Base for CVE β CWE Classification
Fine-tuned DeBERTa-v3-Base model for predicting Common Weakness Enumeration (CWE) IDs from Common Vulnerabilities and Exposures (CVE) descriptions.
Model Details
Base Model: microsoft/deberta-v3-base (86M parameters) Task: Multi-class text classification (695 CWE classes) Training Dataset: stasvinokur/cve-and-cwe-dataset-1999-2025 Cleaned Dataset: LorenzoNava/cve-cwe-dataset-cleaned (225,144 samples)
Training Configuration
Hardware
- GPUs: 4x NVIDIA L4 (24GB each, 96GB total VRAM)
- Precision: bfloat16 (bf16)
Hyperparameters
learning_rate = 2e-5
num_train_epochs = 10
per_device_train_batch_size = 8 # 32 total across 4 GPUs
gradient_accumulation_steps = 1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
weight_decay = 0.01
max_sequence_length = 256
optimizer = "paged_adamw_8bit"
gradient_checkpointing = False # Disabled for stability
Training Details
- Total samples: 225,144 (after filtering "NVD-CWE-Other")
- Train/Val split: 90/10
- Early stopping: patience=5 on F1 score
- Evaluation metric: Weighted F1 score
- Training time: ~5-6 hours on 4x L4 GPUs
Dataset Preparation
The original dataset contained 280,694 samples, including 55,550 samples (19.79%) labeled as "NVD-CWE-Other" (non-standard CWE classification).
Cleaning process:
- Removed samples with
CWE-ID = "NVD-CWE-Other" - Removed samples with missing/null CWE-IDs
- Kept only standard
CWE-XXXXformat (numeric IDs) - Final dataset: 225,144 samples with 695 unique CWE classes
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("LorenzoNava/deberta-v3-base-cve-cwe-classifier")
tokenizer = AutoTokenizer.from_pretrained("LorenzoNava/deberta-v3-base-cve-cwe-classifier")
# Example CVE description
cve_description = """
A buffer overflow vulnerability in the web server component allows
remote attackers to execute arbitrary code via a crafted HTTP request.
"""
# Tokenize and predict
inputs = tokenizer(cve_description, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=-1).item()
predicted_cwe = model.config.id2label[predicted_class]
print(f"Predicted CWE: {predicted_cwe}")
Performance
| Metric | Score |
|---|---|
| Accuracy | TBD |
| Weighted F1 | TBD |
| Training Loss | TBD |
| Validation Loss | TBD |
Metrics will be updated after training completes
Training Script
The model was trained using the following configuration:
python3 train.py \
--model deberta-v3-base \
--epochs 10 \
--batch-size 32 \
--learning-rate 2e-5 \
--max-length 256 \
--early-stopping 5
Full training script included in model repository: train.py
CWE Classes
The model predicts from 695 unique CWE classes including:
- CWE-79 (Cross-site Scripting)
- CWE-89 (SQL Injection)
- CWE-119 (Buffer Errors)
- CWE-20 (Improper Input Validation)
- CWE-200 (Information Exposure)
- ... and 690 more
Use Cases
- Automated vulnerability classification from CVE descriptions
- Security assessment and triage
- Weakness pattern identification in vulnerability reports
- CVE database enrichment and standardization
Limitations
- Trained only on CVE descriptions (English text)
- Performance may vary on non-CVE vulnerability descriptions
- Does not predict "NVD-CWE-Other" or other non-standard classifications
- Limited to CWEs present in training data (695 classes)
Citation
@model{deberta-v3-cve-cwe-2024,
author = {Berghem - Smart Information Security},
title = {DeBERTa-v3-Base for CVE-CWE Classification},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/LorenzoNava/deberta-v3-base-cve-cwe-classifier}
}
License
MIT License - See LICENSE file
Developed By
Berghem - Smart Information Security
For issues or questions, visit the model repository.
- Downloads last month
- 4