cisco-ai
/

SecureBERT2.0-code-vuln-detection

@@ -7,35 +7,88 @@ base_model:
 pipeline_tag: text-classification
 library_name: transformers
 ---
-# ModernBERT Code Vulnerability Detection Model
-This is a ModernBERT model fine-tuned for **Code Vulnerability Detection**.
-It is built on top of **SecureBERT 2.0**.
 ## Model Details
-- Model type: `classification`
-- Number of labels: 2
-- Architecture: ModernBertForSequenceClassification
-- Hugging Face compatible directory: contains model weights, config, and tokenizer files.
-## Usage Example
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
-# Path to your converted Hugging Face model folder
 model_dir = "CiscoAITeam/SecureBERT2.0-code-vuln-detection"
 # Load tokenizer and model
 tokenizer = AutoTokenizer.from_pretrained(model_dir)
 model = AutoModelForSequenceClassification.from_pretrained(model_dir)
-# Put model in evaluation mode
-model.eval()
-# Example input code snippet (string)
 example_code = """
-  static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4)
 {
     int16_t icoef;
     int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5;
@@ -51,51 +104,97 @@ example_code = """
                 VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef];
     }
     VAR_0->cdlms[VAR_1][VAR_2].VAR_5--;
-    VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues[VAR_5] = av_clip(VAR_3, -range, range - 1);
-    if (VAR_3 > VAR_4)
-        VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5] = VAR_0->update_speed[VAR_1];
-    else if (VAR_3 < VAR_4)
-        VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5] = -VAR_0->update_speed[VAR_1];
-    VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5 + VAR_0->cdlms[VAR_1][VAR_2].order >> 4] >>= 2;
-    VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5 + VAR_0->cdlms[VAR_1][VAR_2].order >> 3] >>= 1;
-    if (VAR_0->cdlms[VAR_1][VAR_2].VAR_5 == 0) {
-        memcpy(VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues + VAR_0->cdlms[VAR_1][VAR_2].order,
-               VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues,
-               VAR_6 * VAR_0->cdlms[VAR_1][VAR_2].order);
-        memcpy(VAR_0->cdlms[VAR_1][VAR_2].lms_updates + VAR_0->cdlms[VAR_1][VAR_2].order,
-               VAR_0->cdlms[VAR_1][VAR_2].lms_updates,
-               VAR_6 * VAR_0->cdlms[VAR_1][VAR_2].order);
-        VAR_0->cdlms[VAR_1][VAR_2].VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].order;
-    }
 }
 """
-# Tokenize input
 inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True)
-# Run model
 with torch.no_grad():
     outputs = model(**inputs)
     logits = outputs.logits
-# Get predicted class
-predicted_class = torch.argmax(logits, dim=-1).item()
 print(f"Predicted class ID: {predicted_class}")
 ```
-Reference:
-```
 @article{aghaei2025securebert,
   title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
   author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
   journal={arXiv preprint arXiv:2510.00240},
   year={2025}
 }
-```

 pipeline_tag: text-classification
 library_name: transformers
 ---
+# Model Card for CiscoAITeam/SecureBERT2.0-code-vuln-detection
+The **ModernBERT Code Vulnerability Detection Model** is a fine-tuned variant of **SecureBERT 2.0**, designed to detect potential vulnerabilities in source code.
+It leverages cybersecurity-aware representations learned by SecureBERT 2.0 and applies supervised fine-tuning for binary classification (vulnerable vs. non-vulnerable).
+---
 ## Model Details
+### Model Description
+This model classifies source code snippets as either **vulnerable** or **non-vulnerable** using the ModernBERT architecture.
+It is fine-tuned for **code-level security analysis**, extending the capabilities of SecureBERT 2.0.
+- **Developed by:** Cisco AI Team
+- **Model type:** Sequence classification
+- **Architecture:** `ModernBertForSequenceClassification`
+- **Number of labels:** 2
+- **Language:** English (source code tokens)
+- **License:** Apache-2.0
+- **Finetuned from model:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
+### Model Sources
+- **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-code-vuln-detection](https://huggingface.co/CiscoAITeam/SecureBERT2.0-code-vuln-detection)
+- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
+---
+## Uses
+### Direct Use
+- Automatic vulnerability classification for source code snippets
+- Static analysis pipeline integration for pre-screening code risks
+- Feature extraction for downstream vulnerability detection tasks
+### Downstream Use
+Can be integrated into:
+- Secure code review systems
+- CI/CD vulnerability scanners
+- Security IDE extensions
+### Out-of-Scope Use
+- Non-code or natural language text classification
+- Runtime or dynamic vulnerability detection
+- Automated patch generation or remediation suggestion
+---
+## Bias, Risks, and Limitations
+- The model may **overfit** to syntactic patterns from training datasets and miss logical vulnerabilities.
+- **False negatives** (missed vulnerabilities) or **false positives** (benign code flagged as vulnerable) may occur.
+- Training data may not include all programming languages or frameworks.
+### Recommendations
+Users should use this model **as an assistive tool**, not as a replacement for expert manual code review.
+Cross-validation with multiple tools is recommended before security-critical decisions.
+---
+## How to Get Started with the Model
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
+# Path to the model
 model_dir = "CiscoAITeam/SecureBERT2.0-code-vuln-detection"
 # Load tokenizer and model
 tokenizer = AutoTokenizer.from_pretrained(model_dir)
 model = AutoModelForSequenceClassification.from_pretrained(model_dir)
+# Example input code snippet
 example_code = """
+static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4)
 {
     int16_t icoef;
     int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5;
                 VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef];
     }
     VAR_0->cdlms[VAR_1][VAR_2].VAR_5--;
 }
 """
+# Tokenize and run model
 inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True)
 with torch.no_grad():
     outputs = model(**inputs)
     logits = outputs.logits
+    predicted_class = torch.argmax(logits, dim=-1).item()
 print(f"Predicted class ID: {predicted_class}")
 ```
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+Internal validation split from annotated open-source vulnerability datasets.
+#### Factors
+Evaluated across:
+- Programming language types (C, C++, Python)
+- Vulnerability categories (buffer overflow, injection, logic error)
+#### Metrics
+- Accuracy
+- Precision
+- Recall
+- F1-score
+-
+### Results
+| Model | Accuracy | F1 | Recall | Precision |
+|:------|:---------:|:---:|:-------:|:-----------:|
+| **CodeBERT** | 0.627 | 0.372 | 0.241 | 0.821 |
+| **CyBERT** | 0.459 | 0.630 | 1.000 | 0.459 |
+| **SecureBERT 2.0** | **0.655** | **0.616** | **0.602** | **0.630** |
+#### Summary
+SecureBERT 2.0 demonstrates the best **overall balance of accuracy, F1, and precision** among the compared models.
+While CyBERT achieves the highest recall (detecting all vulnerabilities), it suffers from low precision, indicating many false positives.
+Conversely, CodeBERT exhibits strong precision but poor recall, missing a large portion of true vulnerabilities.
+SecureBERT 2.0 achieves **more consistent and stable performance across all metrics**, reflecting its stronger domain adaptation from cybersecurity-focused pretraining.
+---
+## Environmental Impact
+- **Hardware Type:** 8× A100 GPU cluster
+- **Hours used:** [Information Not Available]
+- **Cloud Provider:** [Information Not Available]
+- **Compute Region:** [Information Not Available]
+- **Carbon Emitted:** [Estimate Not Available]
+Carbon footprint can be estimated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute).
+---
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** ModernBERT (SecureBERT 2.0 backbone)
+- **Objective:** Binary classification
+- **Max sequence length:** 1024 tokens
+- **Parameters:** ~150M
+- **Tensor type:** F32
+### Compute Infrastructure
+- **Framework:** Transformers (PyTorch)
+- **Precision:** fp16 mixed precision
+- **Hardware:** 8 GPUs
+- **Checkpoint Format:** Safetensors
+---
+## Citation
+**BibTeX:**
+```bibtex
 @article{aghaei2025securebert,
   title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
   author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
   journal={arXiv preprint arXiv:2510.00240},
   year={2025}
 }
+```