| | --- |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - cybersecurity |
| | - vulnerability |
| | - cwe |
| | - cve |
| | - nvd |
| | - roberta |
| | base_model: FacebookAI/roberta-base |
| | datasets: |
| | - xamxte/cve-to-cwe |
| | metrics: |
| | - accuracy |
| | - f1 |
| | model-index: |
| | - name: cwe-classifier-roberta-base |
| | results: |
| | - task: |
| | type: text-classification |
| | name: CWE Classification |
| | dataset: |
| | name: cve-to-cwe (test split) |
| | type: xamxte/cve-to-cwe |
| | split: test |
| | metrics: |
| | - name: Top-1 Accuracy |
| | type: accuracy |
| | value: 0.8744 |
| | - name: Top-3 Accuracy |
| | type: accuracy |
| | value: 0.9467 |
| | - name: Macro F1 |
| | type: f1 |
| | value: 0.6071 |
| | - task: |
| | type: text-classification |
| | name: CWE Classification (CTI-Bench) |
| | dataset: |
| | name: CTI-Bench cti-rcm |
| | type: xashru/cti-bench |
| | metrics: |
| | - name: Strict Top-1 |
| | type: accuracy |
| | value: 0.756 |
| | - name: Hierarchy-aware Top-1 |
| | type: accuracy |
| | value: 0.865 |
| | --- |
| | |
| | # CWE Classifier (RoBERTa-base) |
| |
|
| | A fine-tuned RoBERTa-base model that maps CVE (Common Vulnerabilities and Exposures) descriptions to CWE (Common Weakness Enumeration) categories. 125M parameters, 205 CWE classes. |
| |
|
| | ## Performance |
| |
|
| | ### Internal Test Set (27,780 agreement-filtered samples) |
| |
|
| | | Metric | Score | |
| | |--------|-------| |
| | | Top-1 Accuracy | **87.4%** | |
| | | Top-3 Accuracy | **94.7%** | |
| | | Macro F1 | **0.607** | |
| | | Weighted F1 | 0.872 | |
| |
|
| | ### CTI-Bench External Benchmark (NeurIPS 2024, zero training overlap) |
| |
|
| | | Benchmark | Strict Top-1 | Hierarchy-aware Top-1 | |
| | |-----------|--------------|-----------------------| |
| | | cti-rcm (2023-2024 CVEs) | 75.6% | **86.5%** | |
| | | cti-rcm-2021 (2011-2021 CVEs) | 71.8% | **82.8%** | |
| |
|
| | ### Comparison on CTI-Bench cti-rcm (strict exact match) |
| |
|
| | All scores below use the official CTI-Bench evaluation protocol: strict exact CWE ID match. |
| |
|
| | | Model | Params | Type | Top-1 Accuracy | Source | |
| | |-------|--------|------|---------------|--------| |
| | | [Sec-Gemini v1](https://security.googleblog.com/2025/04/google-launches-sec-gemini-v1-new.html) (Google)* | — | closed | ~86% | Google Security Blog | |
| | | [SecLM](https://security.googlecloudcommunity.com/community-blog-42/fueling-ai-innovation-in-secops-products-the-seclm-platform-and-sec-gemini-research-pipeline-3997) (Google)* | — | closed | ~85% | Google Cloud Blog | |
| | | **This model** | **125M** | **open** | **75.6%** | — | |
| | | [Foundation-Sec-8B-Reasoning](https://arxiv.org/abs/2601.21051) (Cisco) | 8B | open | 75.3% | arXiv 2601.21051 | |
| | | [GPT-4](https://arxiv.org/abs/2406.07599) | ~1.7T | closed | 72.0% | CTI-Bench paper | |
| | | [Foundation-Sec-8B](https://arxiv.org/abs/2504.21039) (Cisco) | 8B | open | 72.0% (±1.7%) | arXiv 2504.21039 | |
| | | [WhiteRabbitNeo-V2-70B](https://arxiv.org/abs/2504.21039) | 70B | open | 71.1% | arXiv 2504.21039 | |
| | | [Foundation-Sec-8B-Instruct](https://arxiv.org/abs/2601.21051) (Cisco) | 8B | open | 70.4% | arXiv 2601.21051 | |
| | | [Llama-Primus](https://huggingface.co/trend-cybertron/Llama-Primus-Base) (Trend Micro) | 8B | open | 67.8% | HuggingFace | |
| | | [GPT-3.5](https://arxiv.org/abs/2406.07599) | ~175B | closed | 67.2% | CTI-Bench paper | |
| | | [Gemini 1.5](https://arxiv.org/abs/2406.07599) | — | closed | 66.6% | CTI-Bench paper | |
| | | [LLaMA3-70B](https://arxiv.org/abs/2406.07599) | 70B | open | 65.9% | CTI-Bench paper | |
| | | [LLaMA3-8B](https://arxiv.org/abs/2406.07599) | 8B | open | 44.7% | CTI-Bench paper | |
| |
|
| | *\*Sec-Gemini and SecLM scores are approximate, estimated from published comparison charts. Exact values were not reported.* |
| |
|
| | **Competitive with the best open-weight models** at 64x fewer parameters (125M vs 8B). Note: the 0.3pp difference vs Cisco Foundation-Sec-8B-Reasoning is not statistically significant (95% CIs overlap on n=1000). The Cisco models are general-purpose LLMs; ours is a task-specific encoder. |
| |
|
| | ### TF-IDF baseline comparison |
| |
|
| | A TF-IDF + Logistic Regression baseline reaches 84.9% top-1 on the same test set, but only 45.2% Macro F1 vs our 60.7% — a **+15.5pp Macro F1 gap** showing the model's advantage on rare CWE classes that keyword matching cannot handle. |
| |
|
| | ### Hierarchy-aware evaluation (supplementary) |
| |
|
| | This model predicts specific child CWEs (e.g., CWE-121 Stack Buffer Overflow) while CTI-Bench ground truth often uses generic parent CWEs (e.g., CWE-119 Buffer Overflow). When parent↔child equivalences are counted as correct: |
| |
|
| | | Benchmark | Strict Top-1 | Hierarchy-aware Top-1 | |
| | |-----------|--------------|-----------------------| |
| | | cti-rcm (2023-2024 CVEs) | 75.6% | 86.5% (+10.9pp) | |
| | | cti-rcm-2021 (2011-2021 CVEs) | 71.8% | 82.8% (+11.0pp) | |
| |
|
| | *Note: Other models in the table above were evaluated with strict matching only. Hierarchy-aware scores are not directly comparable and are shown separately for transparency.* |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | classifier = pipeline("text-classification", model="xamxte/cwe-classifier-roberta-base", top_k=3) |
| | |
| | result = classifier("A SQL injection vulnerability in the login page allows remote attackers to execute arbitrary SQL commands via the username parameter.") |
| | print(result) |
| | # [[{'label': 'CWE-89', 'score': 0.95}, {'label': 'CWE-564', 'score': 0.02}, ...]] |
| | ``` |
| |
|
| | ### Manual inference |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | import json |
| | |
| | model_name = "xamxte/cwe-classifier-roberta-base" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | # Load label map |
| | from huggingface_hub import hf_hub_download |
| | label_map_path = hf_hub_download(repo_id=model_name, filename="cwe_label_map.json") |
| | with open(label_map_path) as f: |
| | label_map = json.load(f) |
| | id_to_label = {v: k for k, v in label_map.items()} |
| | |
| | # Predict |
| | text = "CVE Description: A buffer overflow in the PNG parser allows remote code execution via crafted image files." |
| | inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True, padding=True) |
| | |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | |
| | top3 = torch.topk(logits, 3) |
| | for score, idx in zip(top3.values[0], top3.indices[0]): |
| | print(f"{id_to_label[idx.item()]}: {score.item():.3f}") |
| | ``` |
| |
|
| | ## Training |
| |
|
| | - **Base model:** FacebookAI/roberta-base (125M params) |
| | - **Dataset:** [xamxte/cve-to-cwe](https://huggingface.co/datasets/xamxte/cve-to-cwe) — 234,770 training samples with Claude Sonnet 4.6 refined labels |
| | - **Training method:** Two-phase fine-tuning |
| | - Phase 1: Freeze first 8/12 transformer layers, train classifier head (4 epochs, lr=1e-4) |
| | - Phase 2: Unfreeze all layers, full fine-tuning (9 epochs, lr=2e-5) |
| | - **Key hyperparameters:** max_length=384, batch_size=32, label_smoothing=0.1, cosine scheduler, bf16 |
| | - **Hardware:** NVIDIA RTX 5080 (16GB), ~4 hours total |
| | - **Framework:** HuggingFace Transformers + PyTorch |
| | |
| | ## Label Quality |
| | |
| | Training labels were refined using Claude Sonnet 4.6 via the Anthropic Batch API (~$395 total cost). The test/validation sets contain only agreement-filtered samples where NVD and Sonnet labels agree (73.1% exact match; 84.5% with hierarchy-aware matching). This biases evaluation toward unambiguous cases — real-world accuracy on arbitrary NVD entries will be lower. See the [dataset card](https://huggingface.co/datasets/xamxte/cve-to-cwe) for details. |
| | |
| | ## CWE Hierarchy |
| | |
| | This model predicts **specific (child) CWE categories** where possible. For example, buffer overflows are classified as CWE-121 (Stack) or CWE-122 (Heap) rather than the generic CWE-119. This provides more actionable information for vulnerability triage, but means strict accuracy on benchmarks using parent CWEs appears lower than actual performance. |
| | |
| | ## Limitations |
| | |
| | - **205 CWE classes only**: Covers the most common CWEs in NVD. Rare CWEs not in the training set will be mapped to the closest known class. |
| | - **English only**: Trained on English CVE descriptions from NVD. |
| | - **Description-based**: Uses only the text description, not CVSS scores, CPE, or other metadata. |
| | - **Single-label**: Predicts one CWE per CVE, though some vulnerabilities may involve multiple weakness types. |
| | |
| | ## Paper |
| | |
| | 📄 **[Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs](https://arxiv.org/abs/2603.14911)** |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{mosievskiy2026cwe, |
| | title={Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs}, |
| | author={Mosievskiy, Nikita}, |
| | journal={arXiv preprint arXiv:2603.14911}, |
| | year={2026} |
| | url={https://huggingface.co/xamxte/cwe-classifier-roberta-base} |
| | } |
| | ``` |
| | |