File size: 8,777 Bytes

---
license: cc-by-4.0
language:
  - en
library_name: transformers
pipeline_tag: text-classification
tags:
  - cybersecurity
  - vulnerability
  - cwe
  - cve
  - nvd
  - roberta
base_model: FacebookAI/roberta-base
datasets:
  - xamxte/cve-to-cwe
metrics:
  - accuracy
  - f1
model-index:
  - name: cwe-classifier-roberta-base
    results:
      - task:
          type: text-classification
          name: CWE Classification
        dataset:
          name: cve-to-cwe (test split)
          type: xamxte/cve-to-cwe
          split: test
        metrics:
          - name: Top-1 Accuracy
            type: accuracy
            value: 0.8744
          - name: Top-3 Accuracy
            type: accuracy
            value: 0.9467
          - name: Macro F1
            type: f1
            value: 0.6071
      - task:
          type: text-classification
          name: CWE Classification (CTI-Bench)
        dataset:
          name: CTI-Bench cti-rcm
          type: xashru/cti-bench
        metrics:
          - name: Strict Top-1
            type: accuracy
            value: 0.756
          - name: Hierarchy-aware Top-1
            type: accuracy
            value: 0.865
---

# CWE Classifier (RoBERTa-base)

A fine-tuned RoBERTa-base model that maps CVE (Common Vulnerabilities and Exposures) descriptions to CWE (Common Weakness Enumeration) categories. 125M parameters, 205 CWE classes.

## Performance

### Internal Test Set (27,780 agreement-filtered samples)

| Metric | Score |
|--------|-------|
| Top-1 Accuracy | **87.4%** |
| Top-3 Accuracy | **94.7%** |
| Macro F1 | **0.607** |
| Weighted F1 | 0.872 |

### CTI-Bench External Benchmark (NeurIPS 2024, zero training overlap)

| Benchmark | Strict Top-1 | Hierarchy-aware Top-1 |
|-----------|--------------|-----------------------|
| cti-rcm (2023-2024 CVEs) | 75.6% | **86.5%** |
| cti-rcm-2021 (2011-2021 CVEs) | 71.8% | **82.8%** |

### Comparison on CTI-Bench cti-rcm (strict exact match)

All scores below use the official CTI-Bench evaluation protocol: strict exact CWE ID match.

| Model | Params | Type | Top-1 Accuracy | Source |
|-------|--------|------|---------------|--------|
| [Sec-Gemini v1](https://security.googleblog.com/2025/04/google-launches-sec-gemini-v1-new.html) (Google)* | — | closed | ~86% | Google Security Blog |
| [SecLM](https://security.googlecloudcommunity.com/community-blog-42/fueling-ai-innovation-in-secops-products-the-seclm-platform-and-sec-gemini-research-pipeline-3997) (Google)* | — | closed | ~85% | Google Cloud Blog |
| **This model** | **125M** | **open** | **75.6%** | — |
| [Foundation-Sec-8B-Reasoning](https://arxiv.org/abs/2601.21051) (Cisco) | 8B | open | 75.3% | arXiv 2601.21051 |
| [GPT-4](https://arxiv.org/abs/2406.07599) | ~1.7T | closed | 72.0% | CTI-Bench paper |
| [Foundation-Sec-8B](https://arxiv.org/abs/2504.21039) (Cisco) | 8B | open | 72.0% (±1.7%) | arXiv 2504.21039 |
| [WhiteRabbitNeo-V2-70B](https://arxiv.org/abs/2504.21039) | 70B | open | 71.1% | arXiv 2504.21039 |
| [Foundation-Sec-8B-Instruct](https://arxiv.org/abs/2601.21051) (Cisco) | 8B | open | 70.4% | arXiv 2601.21051 |
| [Llama-Primus](https://huggingface.co/trend-cybertron/Llama-Primus-Base) (Trend Micro) | 8B | open | 67.8% | HuggingFace |
| [GPT-3.5](https://arxiv.org/abs/2406.07599) | ~175B | closed | 67.2% | CTI-Bench paper |
| [Gemini 1.5](https://arxiv.org/abs/2406.07599) | — | closed | 66.6% | CTI-Bench paper |
| [LLaMA3-70B](https://arxiv.org/abs/2406.07599) | 70B | open | 65.9% | CTI-Bench paper |
| [LLaMA3-8B](https://arxiv.org/abs/2406.07599) | 8B | open | 44.7% | CTI-Bench paper |

*\*Sec-Gemini and SecLM scores are approximate, estimated from published comparison charts. Exact values were not reported.*

**Competitive with the best open-weight models** at 64x fewer parameters (125M vs 8B). Note: the 0.3pp difference vs Cisco Foundation-Sec-8B-Reasoning is not statistically significant (95% CIs overlap on n=1000). The Cisco models are general-purpose LLMs; ours is a task-specific encoder.

### TF-IDF baseline comparison

A TF-IDF + Logistic Regression baseline reaches 84.9% top-1 on the same test set, but only 45.2% Macro F1 vs our 60.7% — a **+15.5pp Macro F1 gap** showing the model's advantage on rare CWE classes that keyword matching cannot handle.

### Hierarchy-aware evaluation (supplementary)

This model predicts specific child CWEs (e.g., CWE-121 Stack Buffer Overflow) while CTI-Bench ground truth often uses generic parent CWEs (e.g., CWE-119 Buffer Overflow). When parent↔child equivalences are counted as correct:

| Benchmark | Strict Top-1 | Hierarchy-aware Top-1 |
|-----------|--------------|-----------------------|
| cti-rcm (2023-2024 CVEs) | 75.6% | 86.5% (+10.9pp) |
| cti-rcm-2021 (2011-2021 CVEs) | 71.8% | 82.8% (+11.0pp) |

*Note: Other models in the table above were evaluated with strict matching only. Hierarchy-aware scores are not directly comparable and are shown separately for transparency.*

## Usage

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="xamxte/cwe-classifier-roberta-base", top_k=3)

result = classifier("A SQL injection vulnerability in the login page allows remote attackers to execute arbitrary SQL commands via the username parameter.")
print(result)
# [[{'label': 'CWE-89', 'score': 0.95}, {'label': 'CWE-564', 'score': 0.02}, ...]]
```

### Manual inference

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import json

model_name = "xamxte/cwe-classifier-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load label map
from huggingface_hub import hf_hub_download
label_map_path = hf_hub_download(repo_id=model_name, filename="cwe_label_map.json")
with open(label_map_path) as f:
    label_map = json.load(f)
id_to_label = {v: k for k, v in label_map.items()}

# Predict
text = "CVE Description: A buffer overflow in the PNG parser allows remote code execution via crafted image files."
inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True, padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

top3 = torch.topk(logits, 3)
for score, idx in zip(top3.values[0], top3.indices[0]):
    print(f"{id_to_label[idx.item()]}: {score.item():.3f}")
```

## Training

- **Base model:** FacebookAI/roberta-base (125M params)
- **Dataset:** [xamxte/cve-to-cwe](https://huggingface.co/datasets/xamxte/cve-to-cwe) — 234,770 training samples with Claude Sonnet 4.6 refined labels
- **Training method:** Two-phase fine-tuning
  - Phase 1: Freeze first 8/12 transformer layers, train classifier head (4 epochs, lr=1e-4)
  - Phase 2: Unfreeze all layers, full fine-tuning (9 epochs, lr=2e-5)
- **Key hyperparameters:** max_length=384, batch_size=32, label_smoothing=0.1, cosine scheduler, bf16
- **Hardware:** NVIDIA RTX 5080 (16GB), ~4 hours total
- **Framework:** HuggingFace Transformers + PyTorch

## Label Quality

Training labels were refined using Claude Sonnet 4.6 via the Anthropic Batch API (~$395 total cost). The test/validation sets contain only agreement-filtered samples where NVD and Sonnet labels agree (73.1% exact match; 84.5% with hierarchy-aware matching). This biases evaluation toward unambiguous cases — real-world accuracy on arbitrary NVD entries will be lower. See the [dataset card](https://huggingface.co/datasets/xamxte/cve-to-cwe) for details.

## CWE Hierarchy

This model predicts **specific (child) CWE categories** where possible. For example, buffer overflows are classified as CWE-121 (Stack) or CWE-122 (Heap) rather than the generic CWE-119. This provides more actionable information for vulnerability triage, but means strict accuracy on benchmarks using parent CWEs appears lower than actual performance.

## Limitations

- **205 CWE classes only**: Covers the most common CWEs in NVD. Rare CWEs not in the training set will be mapped to the closest known class.
- **English only**: Trained on English CVE descriptions from NVD.
- **Description-based**: Uses only the text description, not CVSS scores, CPE, or other metadata.
- **Single-label**: Predicts one CWE per CVE, though some vulnerabilities may involve multiple weakness types.

## Paper

📄 **[Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs](https://arxiv.org/abs/2603.14911)**

## Citation

```bibtex
@article{mosievskiy2026cwe,
  title={Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs},
  author={Mosievskiy, Nikita},
  journal={arXiv preprint arXiv:2603.14911},
  year={2026}
  url={https://huggingface.co/xamxte/cwe-classifier-roberta-base}
}
```