|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- cisco-ai/SecureBERT2.0-base |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Model Card for CiscoAI/SecureBERT2.0-code-vuln-detection |
|
|
|
|
|
The **ModernBERT Code Vulnerability Detection Model** is a fine-tuned variant of **SecureBERT 2.0**, designed to detect potential vulnerabilities in source code. |
|
|
It leverages cybersecurity-aware representations learned by SecureBERT 2.0 and applies supervised fine-tuning for binary classification (vulnerable vs. non-vulnerable). |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model classifies source code snippets as either **vulnerable** or **non-vulnerable** using the ModernBERT architecture. |
|
|
It is fine-tuned for **code-level security analysis**, extending the capabilities of SecureBERT 2.0. |
|
|
|
|
|
- **Developed by:** Cisco AI |
|
|
- **Model type:** Sequence classification |
|
|
- **Architecture:** `ModernBertForSequenceClassification` |
|
|
- **Number of labels:** 2 |
|
|
- **Language:** English (source code tokens) |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from model:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [https://huggingface.co/cisco-ai/SecureBERT2.0-code-vuln-detection](https://huggingface.co/cisco-ai/SecureBERT2.0-code-vuln-detection) |
|
|
- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Automatic vulnerability classification for source code snippets |
|
|
- Static analysis pipeline integration for pre-screening code risks |
|
|
- Feature extraction for downstream vulnerability detection tasks |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
Can be integrated into: |
|
|
- Secure code review systems |
|
|
- CI/CD vulnerability scanners |
|
|
- Security IDE extensions |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Non-code or natural language text classification |
|
|
- Runtime or dynamic vulnerability detection |
|
|
- Automated patch generation or remediation suggestion |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- The model may **overfit** to syntactic patterns from training datasets and miss logical vulnerabilities. |
|
|
- **False negatives** (missed vulnerabilities) or **false positives** (benign code flagged as vulnerable) may occur. |
|
|
- Training data may not include all programming languages or frameworks. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should use this model **as an assistive tool**, not as a replacement for expert manual code review. |
|
|
Cross-validation with multiple tools is recommended before security-critical decisions. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Path to the model |
|
|
model_dir = "cisco-ai/SecureBERT2.0-code-vuln-detection" |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_dir) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_dir) |
|
|
|
|
|
# Example input code snippet |
|
|
example_code = """ |
|
|
static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4) |
|
|
{ |
|
|
int16_t icoef; |
|
|
int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5; |
|
|
int16_t range = 1 << (VAR_0->bits_per_sample - 1); |
|
|
int VAR_6 = VAR_0->bits_per_sample > 16 ? 4 : 2; |
|
|
if (VAR_3 > VAR_4) { |
|
|
for (icoef = 0; icoef < VAR_0->cdlms[VAR_1][VAR_2].order; icoef++) |
|
|
VAR_0->cdlms[VAR_1][VAR_2].coefs[icoef] += |
|
|
VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef + VAR_5]; |
|
|
} else { |
|
|
for (icoef = 0; icoef < VAR_0->cdlms[VAR_1][VAR_2].order; icoef++) |
|
|
VAR_0->cdlms[VAR_1][VAR_2].coefs[icoef] -= |
|
|
VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef]; |
|
|
} |
|
|
VAR_0->cdlms[VAR_1][VAR_2].VAR_5--; |
|
|
} |
|
|
""" |
|
|
|
|
|
# Tokenize and run model |
|
|
inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
predicted_class = torch.argmax(logits, dim=-1).item() |
|
|
|
|
|
print(f"Predicted class ID: {predicted_class}") |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
Internal validation split from annotated open-source vulnerability datasets. |
|
|
|
|
|
#### Factors |
|
|
|
|
|
Evaluated across: |
|
|
- Programming language types (C, C++, Python) |
|
|
- Vulnerability categories (buffer overflow, injection, logic error) |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
- Accuracy |
|
|
- Precision |
|
|
- Recall |
|
|
- F1-score |
|
|
|
|
|
### Results |
|
|
|
|
|
| Model | Accuracy | F1 | Recall | Precision | |
|
|
|:------|:---------:|:---:|:-------:|:-----------:| |
|
|
| **CodeBERT** | 0.627 | 0.372 | 0.241 | 0.821 | |
|
|
| **CyBERT** | 0.459 | 0.630 | 1.000 | 0.459 | |
|
|
| **SecureBERT 2.0** | **0.655** | **0.616** | **0.602** | **0.630** | |
|
|
|
|
|
#### Summary |
|
|
|
|
|
SecureBERT 2.0 demonstrates the best **overall balance of accuracy, F1, and precision** among the compared models. |
|
|
While CyBERT achieves the highest recall (detecting all vulnerabilities), it suffers from low precision, indicating many false positives. |
|
|
Conversely, CodeBERT exhibits strong precision but poor recall, missing a large portion of true vulnerabilities. |
|
|
SecureBERT 2.0 achieves **more consistent and stable performance across all metrics**, reflecting its stronger domain adaptation from cybersecurity-focused pretraining. |
|
|
|
|
|
--- |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** 8× A100 GPU cluster |
|
|
- **Hours used:** [Information Not Available] |
|
|
- **Cloud Provider:** [Information Not Available] |
|
|
- **Compute Region:** [Information Not Available] |
|
|
- **Carbon Emitted:** [Estimate Not Available] |
|
|
|
|
|
Carbon footprint can be estimated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute). |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Architecture:** ModernBERT (SecureBERT 2.0 backbone) |
|
|
- **Objective:** Binary classification |
|
|
- **Max sequence length:** 1024 tokens |
|
|
- **Parameters:** ~150M |
|
|
- **Tensor type:** F32 |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
- **Framework:** Transformers (PyTorch) |
|
|
- **Precision:** fp16 mixed precision |
|
|
- **Hardware:** 8 GPUs |
|
|
- **Checkpoint Format:** Safetensors |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@article{aghaei2025securebert, |
|
|
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence}, |
|
|
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun}, |
|
|
journal={arXiv preprint arXiv:2510.00240}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Cisco AI |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com) |