cisco-ehsan commited on
Commit
ea797ed
·
verified ·
1 Parent(s): 49a33f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -45
README.md CHANGED
@@ -7,35 +7,88 @@ base_model:
7
  pipeline_tag: text-classification
8
  library_name: transformers
9
  ---
10
- # ModernBERT Code Vulnerability Detection Model
11
 
12
- This is a ModernBERT model fine-tuned for **Code Vulnerability Detection**.
13
- It is built on top of **SecureBERT 2.0**.
 
 
 
 
14
 
15
  ## Model Details
16
- - Model type: `classification`
17
- - Number of labels: 2
18
- - Architecture: ModernBertForSequenceClassification
19
- - Hugging Face compatible directory: contains model weights, config, and tokenizer files.
20
 
21
- ## Usage Example
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ```python
23
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
24
  import torch
25
 
26
- # Path to your converted Hugging Face model folder
27
  model_dir = "CiscoAITeam/SecureBERT2.0-code-vuln-detection"
28
 
29
  # Load tokenizer and model
30
  tokenizer = AutoTokenizer.from_pretrained(model_dir)
31
  model = AutoModelForSequenceClassification.from_pretrained(model_dir)
32
 
33
- # Put model in evaluation mode
34
- model.eval()
35
-
36
- # Example input code snippet (string)
37
  example_code = """
38
- static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4)
39
  {
40
  int16_t icoef;
41
  int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5;
@@ -51,51 +104,97 @@ example_code = """
51
  VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef];
52
  }
53
  VAR_0->cdlms[VAR_1][VAR_2].VAR_5--;
54
- VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues[VAR_5] = av_clip(VAR_3, -range, range - 1);
55
- if (VAR_3 > VAR_4)
56
- VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5] = VAR_0->update_speed[VAR_1];
57
- else if (VAR_3 < VAR_4)
58
- VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5] = -VAR_0->update_speed[VAR_1];
59
-
60
- VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5 + VAR_0->cdlms[VAR_1][VAR_2].order >> 4] >>= 2;
61
- VAR_0->cdlms[VAR_1][VAR_2].lms_updates[VAR_5 + VAR_0->cdlms[VAR_1][VAR_2].order >> 3] >>= 1;
62
-
63
- if (VAR_0->cdlms[VAR_1][VAR_2].VAR_5 == 0) {
64
-
65
- memcpy(VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues + VAR_0->cdlms[VAR_1][VAR_2].order,
66
- VAR_0->cdlms[VAR_1][VAR_2].lms_prevvalues,
67
- VAR_6 * VAR_0->cdlms[VAR_1][VAR_2].order);
68
- memcpy(VAR_0->cdlms[VAR_1][VAR_2].lms_updates + VAR_0->cdlms[VAR_1][VAR_2].order,
69
- VAR_0->cdlms[VAR_1][VAR_2].lms_updates,
70
- VAR_6 * VAR_0->cdlms[VAR_1][VAR_2].order);
71
- VAR_0->cdlms[VAR_1][VAR_2].VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].order;
72
- }
73
  }
74
-
75
-
76
-
77
  """
78
 
79
- # Tokenize input
80
  inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True)
81
-
82
- # Run model
83
  with torch.no_grad():
84
  outputs = model(**inputs)
85
  logits = outputs.logits
86
-
87
- # Get predicted class
88
- predicted_class = torch.argmax(logits, dim=-1).item()
89
 
90
  print(f"Predicted class ID: {predicted_class}")
91
  ```
92
- Reference:
93
 
94
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  @article{aghaei2025securebert,
96
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
97
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
98
  journal={arXiv preprint arXiv:2510.00240},
99
  year={2025}
100
  }
101
- ```
 
 
7
  pipeline_tag: text-classification
8
  library_name: transformers
9
  ---
 
10
 
11
+ # Model Card for CiscoAITeam/SecureBERT2.0-code-vuln-detection
12
+
13
+ The **ModernBERT Code Vulnerability Detection Model** is a fine-tuned variant of **SecureBERT 2.0**, designed to detect potential vulnerabilities in source code.
14
+ It leverages cybersecurity-aware representations learned by SecureBERT 2.0 and applies supervised fine-tuning for binary classification (vulnerable vs. non-vulnerable).
15
+
16
+ ---
17
 
18
  ## Model Details
 
 
 
 
19
 
20
+ ### Model Description
21
+
22
+ This model classifies source code snippets as either **vulnerable** or **non-vulnerable** using the ModernBERT architecture.
23
+ It is fine-tuned for **code-level security analysis**, extending the capabilities of SecureBERT 2.0.
24
+
25
+ - **Developed by:** Cisco AI Team
26
+ - **Model type:** Sequence classification
27
+ - **Architecture:** `ModernBertForSequenceClassification`
28
+ - **Number of labels:** 2
29
+ - **Language:** English (source code tokens)
30
+ - **License:** Apache-2.0
31
+ - **Finetuned from model:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
32
+
33
+ ### Model Sources
34
+
35
+ - **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-code-vuln-detection](https://huggingface.co/CiscoAITeam/SecureBERT2.0-code-vuln-detection)
36
+ - **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
37
+
38
+ ---
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+
44
+ - Automatic vulnerability classification for source code snippets
45
+ - Static analysis pipeline integration for pre-screening code risks
46
+ - Feature extraction for downstream vulnerability detection tasks
47
+
48
+ ### Downstream Use
49
+
50
+ Can be integrated into:
51
+ - Secure code review systems
52
+ - CI/CD vulnerability scanners
53
+ - Security IDE extensions
54
+
55
+ ### Out-of-Scope Use
56
+
57
+ - Non-code or natural language text classification
58
+ - Runtime or dynamic vulnerability detection
59
+ - Automated patch generation or remediation suggestion
60
+
61
+ ---
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ - The model may **overfit** to syntactic patterns from training datasets and miss logical vulnerabilities.
66
+ - **False negatives** (missed vulnerabilities) or **false positives** (benign code flagged as vulnerable) may occur.
67
+ - Training data may not include all programming languages or frameworks.
68
+
69
+ ### Recommendations
70
+
71
+ Users should use this model **as an assistive tool**, not as a replacement for expert manual code review.
72
+ Cross-validation with multiple tools is recommended before security-critical decisions.
73
+
74
+ ---
75
+
76
+ ## How to Get Started with the Model
77
+
78
  ```python
79
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
80
  import torch
81
 
82
+ # Path to the model
83
  model_dir = "CiscoAITeam/SecureBERT2.0-code-vuln-detection"
84
 
85
  # Load tokenizer and model
86
  tokenizer = AutoTokenizer.from_pretrained(model_dir)
87
  model = AutoModelForSequenceClassification.from_pretrained(model_dir)
88
 
89
+ # Example input code snippet
 
 
 
90
  example_code = """
91
+ static void FUNC_0(WmallDecodeCtx *VAR_0, int VAR_1, int VAR_2, int16_t VAR_3, int16_t VAR_4)
92
  {
93
  int16_t icoef;
94
  int VAR_5 = VAR_0->cdlms[VAR_1][VAR_2].VAR_5;
 
104
  VAR_0->cdlms[VAR_1][VAR_2].lms_updates[icoef];
105
  }
106
  VAR_0->cdlms[VAR_1][VAR_2].VAR_5--;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  }
 
 
 
108
  """
109
 
110
+ # Tokenize and run model
111
  inputs = tokenizer(example_code, return_tensors="pt", truncation=True, padding=True)
 
 
112
  with torch.no_grad():
113
  outputs = model(**inputs)
114
  logits = outputs.logits
115
+ predicted_class = torch.argmax(logits, dim=-1).item()
 
 
116
 
117
  print(f"Predicted class ID: {predicted_class}")
118
  ```
 
119
 
120
+ ## Evaluation
121
+
122
+ ### Testing Data, Factors & Metrics
123
+
124
+ #### Testing Data
125
+
126
+ Internal validation split from annotated open-source vulnerability datasets.
127
+
128
+ #### Factors
129
+
130
+ Evaluated across:
131
+ - Programming language types (C, C++, Python)
132
+ - Vulnerability categories (buffer overflow, injection, logic error)
133
+
134
+ #### Metrics
135
+
136
+ - Accuracy
137
+ - Precision
138
+ - Recall
139
+ - F1-score
140
+ -
141
+ ### Results
142
+
143
+ | Model | Accuracy | F1 | Recall | Precision |
144
+ |:------|:---------:|:---:|:-------:|:-----------:|
145
+ | **CodeBERT** | 0.627 | 0.372 | 0.241 | 0.821 |
146
+ | **CyBERT** | 0.459 | 0.630 | 1.000 | 0.459 |
147
+ | **SecureBERT 2.0** | **0.655** | **0.616** | **0.602** | **0.630** |
148
+
149
+ #### Summary
150
+
151
+ SecureBERT 2.0 demonstrates the best **overall balance of accuracy, F1, and precision** among the compared models.
152
+ While CyBERT achieves the highest recall (detecting all vulnerabilities), it suffers from low precision, indicating many false positives.
153
+ Conversely, CodeBERT exhibits strong precision but poor recall, missing a large portion of true vulnerabilities.
154
+ SecureBERT 2.0 achieves **more consistent and stable performance across all metrics**, reflecting its stronger domain adaptation from cybersecurity-focused pretraining.
155
+
156
+ ---
157
+
158
+ ## Environmental Impact
159
+
160
+ - **Hardware Type:** 8× A100 GPU cluster
161
+ - **Hours used:** [Information Not Available]
162
+ - **Cloud Provider:** [Information Not Available]
163
+ - **Compute Region:** [Information Not Available]
164
+ - **Carbon Emitted:** [Estimate Not Available]
165
+
166
+ Carbon footprint can be estimated using the [Machine Learning Impact Calculator](https://mlco2.github.io/impact#compute).
167
+
168
+ ---
169
+
170
+ ## Technical Specifications
171
+
172
+ ### Model Architecture and Objective
173
+
174
+ - **Architecture:** ModernBERT (SecureBERT 2.0 backbone)
175
+ - **Objective:** Binary classification
176
+ - **Max sequence length:** 1024 tokens
177
+ - **Parameters:** ~150M
178
+ - **Tensor type:** F32
179
+
180
+ ### Compute Infrastructure
181
+
182
+ - **Framework:** Transformers (PyTorch)
183
+ - **Precision:** fp16 mixed precision
184
+ - **Hardware:** 8 GPUs
185
+ - **Checkpoint Format:** Safetensors
186
+
187
+ ---
188
+
189
+ ## Citation
190
+
191
+ **BibTeX:**
192
+ ```bibtex
193
  @article{aghaei2025securebert,
194
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
195
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
196
  journal={arXiv preprint arXiv:2510.00240},
197
  year={2025}
198
  }
199
+ ```
200
+