File size: 6,500 Bytes
49a33f6 7cbdb2f 49a33f6 d53eed6 f20d8b0 ea797ed d53eed6 ea797ed b1e4a97 ea797ed f260674 ea797ed f260674 ea797ed d53eed6 e493cdf ea797ed f260674 e493cdf ea797ed e493cdf ea797ed e493cdf ea797ed e493cdf ea797ed e493cdf d53eed6 ea797ed 8160d7b ea797ed e493cdf ea797ed 61359e4 b1e4a97 61359e4 5bed2dc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: text-classification
library_name: transformers
---
# Model Card for CiscoAI/SecureBERT2.0-code-vuln-detection
The **ModernBERT Code Vulnerability Detection Model** is a fine-tuned variant of **SecureBERT 2.0**, designed to detect potential vulnerabilities in source code.
It leverages cybersecurity-aware representations learned by SecureBERT 2.0 and applies supervised fine-tuning for binary classification (vulnerable vs. non-vulnerable).
---
## Model Details
### Model Description
This model classifies source code snippets as either **vulnerable** or **non-vulnerable** using the ModernBERT architecture.
It is fine-tuned for **code-level security analysis**, extending the capabilities of SecureBERT 2.0.
- **Developed by:** Cisco AI
- **Model type:** Sequence classification
- **Architecture:** `ModernBertForSequenceClassification`
- **Number of labels:** 2
- **Language:** English (source code tokens)
- **License:** Apache-2.0
- **Finetuned from model:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base)
### Model Sources
- **Repository:** [https://huggingface.co/cisco-ai/SecureBERT2.0-code-vuln-detection](https://huggingface.co/cisco-ai/SecureBERT2.0-code-vuln-detection)
- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
---
## Uses
### Direct Use
- Automatic vulnerability classification for source code snippets
- Static analysis pipeline integration for pre-screening code risks
- Feature extraction for downstream vulnerability detection tasks
### Downstream Use
Can be integrated into:
- Secure code review systems
- CI/CD vulnerability scanners
- Security IDE extensions
### Out-of-Scope Use
- Non-code or natural language text classification
- Runtime or dynamic vulnerability detection
- Automated patch generation or remediation suggestion
---
## Bias, Risks, and Limitations
- The model may **overfit** to syntactic patterns from training datasets and miss logical vulnerabilities.
- **False negatives** (missed vulnerabilities) or **false positives** (benign code flagged as vulnerable) may occur.
- Training data may not include all programming languages or frameworks.
### Recommendations
Users should use this model **as an assistive tool**, not as a replacement for expert manual code review.
Cross-validation with multiple tools is recommended before security-critical decisions.
---
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Path to the model
model_dir = "cisco-ai/SecureBERT2.0-code-vuln-detection"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
# Example input code snippet
example_code = """
static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4)
{
int16_t icoef;
int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5;
int16_t range = 1 << (VAR_0->bits_per_sample - 1);
int VAR_6 = VAR_0->bits_per_sample > 16 ? 4 : 2;
if (VAR_3 > VAR_4) {
for (icoef = 0; icoef < VAR_0->cdlms[VAR_1][VAR_2].order; icoef++)
VAR_0->cdlms[VAR_1][VAR_2].coefs[icoef] +=
VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef + VAR_5];
} else {
for (icoef = 0; icoef < VAR_0->cdlms[VAR_1][VAR_2].order; icoef++)
VAR_0->cdlms[VAR_1][VAR_2].coefs[icoef] -=
VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef];
}
VAR_0->cdlms[VAR_1][VAR_2].VAR_5--;
}
"""
# Tokenize and run model
inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()
print(f"Predicted class ID: {predicted_class}")
```
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Internal validation split from annotated open-source vulnerability datasets.
#### Factors
Evaluated across:
- Programming language types (C, C++, Python)
- Vulnerability categories (buffer overflow, injection, logic error)
#### Metrics
- Accuracy
- Precision
- Recall
- F1-score
### Results
| Model | Accuracy | F1 | Recall | Precision |
|:------|:---------:|:---:|:-------:|:-----------:|
| **CodeBERT** | 0.627 | 0.372 | 0.241 | 0.821 |
| **CyBERT** | 0.459 | 0.630 | 1.000 | 0.459 |
| **SecureBERT 2.0** | **0.655** | **0.616** | **0.602** | **0.630** |
#### Summary
SecureBERT 2.0 demonstrates the best **overall balance of accuracy, F1, and precision** among the compared models.
While CyBERT achieves the highest recall (detecting all vulnerabilities), it suffers from low precision, indicating many false positives.
Conversely, CodeBERT exhibits strong precision but poor recall, missing a large portion of true vulnerabilities.
SecureBERT 2.0 achieves **more consistent and stable performance across all metrics**, reflecting its stronger domain adaptation from cybersecurity-focused pretraining.
---
## Environmental Impact
- **Hardware Type:** 8× A100 GPU cluster
- **Hours used:** [Information Not Available]
- **Cloud Provider:** [Information Not Available]
- **Compute Region:** [Information Not Available]
- **Carbon Emitted:** [Estimate Not Available]
Carbon footprint can be estimated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute).
---
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** ModernBERT (SecureBERT 2.0 backbone)
- **Objective:** Binary classification
- **Max sequence length:** 1024 tokens
- **Parameters:** ~150M
- **Tensor type:** F32
### Compute Infrastructure
- **Framework:** Transformers (PyTorch)
- **Precision:** fp16 mixed precision
- **Hardware:** 8 GPUs
- **Checkpoint Format:** Safetensors
---
## Citation
**BibTeX:**
```bibtex
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
```
---
## Model Card Authors
Cisco AI
## Model Card Contact
For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com) |