mulliken commited on
Commit
4623741
·
verified ·
1 Parent(s): 3d09990

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +185 -91
README.md CHANGED
@@ -4,6 +4,8 @@ tags:
4
  - security
5
  - cyber-security
6
  - CWE
 
 
7
  license: apache-2.0
8
  datasets:
9
  - zefang-liu/cve-and-cwe-mapping-dataset
@@ -11,44 +13,52 @@ language:
11
  - en
12
  metrics:
13
  - accuracy
 
14
  base_model:
15
  - distilbert/distilbert-base-uncased
16
  pipeline_tag: text-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
- # Model Card for Model ID
20
-
21
- <!-- Provide a quick summary of what the model is/does. -->
22
- This model is designed to allow you to predict a single CWE given your description of a vulnerability.
23
 
 
24
 
25
  ## Model Details
26
 
27
  ### Model Description
28
 
29
- <!-- Provide a longer summary of what this model is. -->
30
- The model takes in text and predicts a single CWE. On it's last training run it saw the following:
31
-
32
- - Training Loss: 1.158700
33
- - Validation Loss: 1.199677
34
- - Accuracy: 0.71136
35
- - F1: 0.229855
36
 
37
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
 
 
38
 
39
  - **Developed by:** [mulliken](https://huggingface.co/mulliken)
40
- - **Model type:** BERT
41
  - **Language(s) (NLP):** English
42
  - **License:** Apache 2.0
43
  - **Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
44
 
45
- ### Model Sources [optional]
46
 
47
- <!-- Provide the basic links for the model. -->
48
-
49
- - **Repository:** [More Information Needed]
50
- - **Paper [optional]:** [More Information Needed]
51
- - **Demo [optional]:** [More Information Needed]
52
 
53
  ## Uses
54
 
@@ -56,66 +66,140 @@ This is the model card of a 🤗 transformers model that has been pushed on the
56
 
57
  ### Direct Use
58
 
59
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
60
-
61
- [More Information Needed]
 
 
62
 
63
- ### Downstream Use [optional]
64
 
65
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
66
-
67
- [More Information Needed]
 
 
68
 
69
  ### Out-of-Scope Use
70
 
71
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
72
-
73
- [More Information Needed]
 
 
74
 
75
  ## Bias, Risks, and Limitations
76
 
77
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
78
 
79
- [More Information Needed]
 
 
 
80
 
81
  ### Recommendations
82
 
83
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
84
-
85
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
86
 
87
  ## How to Get Started with the Model
88
 
89
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ## Training Details
94
 
95
  ### Training Data
96
 
97
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
98
-
99
- [More Information Needed]
 
 
 
 
100
 
101
  ### Training Procedure
102
 
103
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
104
 
105
- #### Preprocessing [optional]
106
 
107
- [More Information Needed]
 
 
 
108
 
 
 
 
109
 
110
  #### Training Hyperparameters
111
 
112
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
113
 
114
- #### Speeds, Sizes, Times [optional]
115
 
116
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
117
-
118
- [More Information Needed]
 
 
 
119
 
120
  ## Evaluation
121
 
@@ -125,92 +209,102 @@ Use the code below to get started with the model.
125
 
126
  #### Testing Data
127
 
128
- <!-- This should link to a Dataset Card if possible. -->
129
-
130
- [More Information Needed]
131
-
132
- #### Factors
133
-
134
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
135
-
136
- [More Information Needed]
137
 
138
  #### Metrics
139
 
140
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
141
-
142
- [More Information Needed]
143
 
144
  ### Results
145
 
146
- [More Information Needed]
 
 
 
 
 
 
 
147
 
148
  #### Summary
149
 
 
150
 
151
 
152
- ## Model Examination [optional]
153
 
154
- <!-- Relevant interpretability work for the model goes here -->
155
 
156
- [More Information Needed]
 
 
 
157
 
158
  ## Environmental Impact
159
 
160
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
161
-
162
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
163
 
164
- - **Hardware Type:** [More Information Needed]
165
- - **Hours used:** [More Information Needed]
166
- - **Cloud Provider:** [More Information Needed]
167
- - **Compute Region:** [More Information Needed]
168
- - **Carbon Emitted:** [More Information Needed]
169
 
170
  ## Technical Specifications [optional]
171
 
172
  ### Model Architecture and Objective
173
 
174
- [More Information Needed]
 
 
 
 
175
 
176
  ### Compute Infrastructure
177
 
178
- [More Information Needed]
179
 
180
  #### Hardware
181
 
182
- [More Information Needed]
 
183
 
184
  #### Software
185
 
186
- [More Information Needed]
187
-
188
- ## Citation [optional]
189
-
190
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
191
-
192
- **BibTeX:**
193
-
194
- [More Information Needed]
195
 
196
- **APA:**
197
 
198
- [More Information Needed]
199
 
200
- ## Glossary [optional]
 
 
 
 
 
 
 
 
201
 
202
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
203
 
204
- [More Information Needed]
 
 
 
 
205
 
206
- ## More Information [optional]
207
 
208
- [More Information Needed]
209
 
210
- ## Model Card Authors [optional]
211
 
212
- [More Information Needed]
213
 
214
  ## Model Card Contact
215
 
216
- [More Information Needed]
 
4
  - security
5
  - cyber-security
6
  - CWE
7
+ - vulnerability-classification
8
+ - cve
9
  license: apache-2.0
10
  datasets:
11
  - zefang-liu/cve-and-cwe-mapping-dataset
 
13
  - en
14
  metrics:
15
  - accuracy
16
+ - f1
17
  base_model:
18
  - distilbert/distilbert-base-uncased
19
  pipeline_tag: text-classification
20
+ model-index:
21
+ - name: cwe-predictor
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: CWE Classification
26
+ metrics:
27
+ - type: accuracy
28
+ value: 0.727207
29
+ name: Validation Accuracy
30
+ - type: f1
31
+ value: 0.251264
32
+ name: Macro F1 Score
33
  ---
34
 
35
+ # CWE Predictor - Vulnerability Classification Model
 
 
 
36
 
37
+ This model classifies vulnerability descriptions into Common Weakness Enumeration (CWE) categories. It's designed to help security professionals and developers quickly identify the type of vulnerability based on textual descriptions.
38
 
39
  ## Model Details
40
 
41
  ### Model Description
42
 
43
+ This is a fine-tuned DistilBERT model that predicts CWE (Common Weakness Enumeration) categories from vulnerability descriptions. The model was trained on a comprehensive dataset of CVE descriptions mapped to their corresponding CWE identifiers.
 
 
 
 
 
 
44
 
45
+ **Key Features:**
46
+ - Classifies vulnerabilities into 232 distinct CWE categories
47
+ - Trained on 111,640 vulnerability descriptions
48
+ - Achieves 72.72% accuracy on validation set
49
+ - Macro F1 score of 0.251 demonstrating balanced performance across classes
50
+ - Lightweight and fast inference using DistilBERT architecture
51
 
52
  - **Developed by:** [mulliken](https://huggingface.co/mulliken)
53
+ - **Model type:** DistilBERT (Transformer-based classifier)
54
  - **Language(s) (NLP):** English
55
  - **License:** Apache 2.0
56
  - **Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
57
 
58
+ ### Model Sources
59
 
60
+ - **Hugging Face Model:** [mulliken/cwe-predictor](https://huggingface.co/mulliken/cwe-predictor)
61
+ - **Dataset:** [CVE and CWE Mapping Dataset](https://huggingface.co/datasets/zefang-liu/cve-and-cwe-mapping-dataset)
 
 
 
62
 
63
  ## Uses
64
 
 
66
 
67
  ### Direct Use
68
 
69
+ This model can be used directly for:
70
+ - **Vulnerability Triage:** Automatically classify security vulnerabilities reported in bug bounty programs or security audits
71
+ - **Security Analysis:** Categorize CVE descriptions to understand vulnerability patterns
72
+ - **Automated Security Reporting:** Generate CWE classifications for vulnerability reports
73
+ - **Security Research:** Analyze trends in vulnerability types across codebases
74
 
75
+ ### Downstream Use
76
 
77
+ The model can be integrated into:
78
+ - Security scanning tools and SAST/DAST platforms
79
+ - Vulnerability management systems
80
+ - Security information and event management (SIEM) systems
81
+ - DevSecOps pipelines for automated vulnerability classification
82
 
83
  ### Out-of-Scope Use
84
 
85
+ This model should NOT be used for:
86
+ - Medical or safety-critical systems without additional validation
87
+ - As the sole method for security assessment (should complement human expertise)
88
+ - Classifying non-English vulnerability descriptions
89
+ - Real-time security detection (model is designed for post-discovery classification)
90
 
91
  ## Bias, Risks, and Limitations
92
 
93
+ ### Known Limitations
94
+ - **Class Imbalance:** Some CWE categories are underrepresented in the training data, which may lead to lower accuracy for rare vulnerability types
95
+ - **Temporal Bias:** Model trained on historical CVE data may not recognize newer vulnerability patterns
96
+ - **Language Limitation:** Only trained on English descriptions
97
+ - **Context Loss:** Limited to 512 tokens, longer descriptions are truncated
98
 
99
+ ### Risks
100
+ - False negatives could lead to unidentified security vulnerabilities
101
+ - Should not replace human security expertise
102
+ - May not generalize well to proprietary or domain-specific vulnerability descriptions
103
 
104
  ### Recommendations
105
 
106
+ - Always use this model as a supplementary tool alongside human security expertise
107
+ - Validate predictions for critical security decisions
108
+ - Consider retraining or fine-tuning for domain-specific applications
109
+ - Monitor model performance over time as new vulnerability types emerge
110
 
111
  ## How to Get Started with the Model
112
 
113
+ ### Installation
114
+
115
+ ```bash
116
+ pip install transformers torch
117
+ ```
118
+
119
+ ### Quick Start
120
+
121
+ ```python
122
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
123
+ import torch
124
+
125
+ # Load model and tokenizer
126
+ model = AutoModelForSequenceClassification.from_pretrained("mulliken/cwe-predictor")
127
+ tokenizer = AutoTokenizer.from_pretrained("mulliken/cwe-predictor")
128
+
129
+ # Prediction function
130
+ def predict_cwe(text: str) -> str:
131
+ encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
132
+ with torch.no_grad():
133
+ logits = model(**encoded).logits
134
+ pred_id = torch.argmax(logits, dim=-1).item()
135
+ return model.config.id2label[pred_id]
136
+
137
+ # Example usage
138
+ vuln_description = "Buffer overflow in the authentication module allows remote attackers to execute arbitrary code."
139
+ cwe_prediction = predict_cwe(vuln_description)
140
+ print(f"Predicted CWE: {cwe_prediction}")
141
+ ```
142
+
143
+ ### Example Predictions
144
 
145
+ ```python
146
+ examples = [
147
+ "SQL injection vulnerability in login form allows attackers to bypass authentication",
148
+ "Cross-site scripting (XSS) vulnerability in comment section",
149
+ "Path traversal vulnerability allows reading arbitrary files",
150
+ "Integer overflow in image processing library causes memory corruption"
151
+ ]
152
+
153
+ for desc in examples:
154
+ print(f"Description: {desc}")
155
+ print(f"Predicted CWE: {predict_cwe(desc)}\n")
156
+ ```
157
 
158
  ## Training Details
159
 
160
  ### Training Data
161
 
162
+ The model was trained on the [CVE and CWE Mapping Dataset](https://huggingface.co/datasets/zefang-liu/cve-and-cwe-mapping-dataset), which contains:
163
+ - CVE descriptions from the National Vulnerability Database (NVD)
164
+ - Corresponding CWE classifications
165
+ - Dataset size: 124,045 examples after filtering
166
+ - Training set: 111,640 examples
167
+ - Validation set: 12,405 examples
168
+ - Number of CWE classes: 232 (after removing generic categories like "NVD-CWE-Other" and "NVD-CWE-noinfo")
169
 
170
  ### Training Procedure
171
 
172
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
173
 
174
+ #### Preprocessing
175
 
176
+ 1. **Data Cleaning:**
177
+ - Removed entries with missing descriptions or CWE IDs
178
+ - Filtered out generic CWE categories ("NVD-CWE-Other", "NVD-CWE-noinfo")
179
+ - Removed CWE categories with only 1 example to ensure stratified splitting
180
 
181
+ 2. **Tokenization:**
182
+ - Used DistilBERT tokenizer with max_length=512
183
+ - Applied truncation for longer descriptions
184
 
185
  #### Training Hyperparameters
186
 
187
+ - **Learning rate:** 2e-5
188
+ - **Batch size:** 2 per device with gradient accumulation of 8 (effective batch size: 16)
189
+ - **Number of epochs:** 1
190
+ - **Weight decay:** 0.01
191
+ - **Optimizer:** AdamW
192
+ - **Training regime:** fp32 with gradient checkpointing
193
+ - **Evaluation strategy:** Every 1000 steps
194
 
195
+ #### Training Performance
196
 
197
+ - **Total training time:** ~78 minutes (4712 seconds) (per epoch)
198
+ - **Training steps:** 13,956
199
+ - **Training samples per second:** 23.691
200
+ - **Final training loss:** 1.134700
201
+ - **Best validation loss:** 1.082806 (at step 6000)
202
+ - **Model size:** ~268MB
203
 
204
  ## Evaluation
205
 
 
209
 
210
  #### Testing Data
211
 
212
+ Validation set of 12,405 examples (10% stratified split from the training data)
 
 
 
 
 
 
 
 
213
 
214
  #### Metrics
215
 
216
+ - **Accuracy:** Overall correctness of predictions
217
+ - **Macro F1 Score:** Unweighted mean of F1 scores for each class (ensures balanced performance across all CWE types)
 
218
 
219
  ### Results
220
 
221
+ | Step | Training Loss | Validation Loss | Accuracy | Macro F1 |
222
+ |------|--------------|-----------------|----------|----------|
223
+ | 1000 | 1.044600 | 1.252940 | 0.704716 | 0.220344 |
224
+ | 2000 | 1.158700 | 1.188677 | 0.711326 | 0.229855 |
225
+ | 3000 | 1.119900 | 1.159229 | 0.719226 | 0.235295 |
226
+ | 4000 | 1.112600 | 1.119924 | 0.720193 | 0.242404 |
227
+ | 5000 | 1.110300 | 1.111053 | 0.722934 | 0.244389 |
228
+ | 6000 | 1.134700 | 1.082806 | 0.727207 | 0.251264 |
229
 
230
  #### Summary
231
 
232
+ The model achieves 72.72% accuracy on the validation set with a macro F1 score of 0.251. The relatively lower F1 score reflects the challenge of classifying across 232 different CWE categories with varying representation in the dataset.
233
 
234
 
 
235
 
236
+ ## Model Examination
237
 
238
+ The model uses standard DistilBERT attention mechanisms to process vulnerability descriptions. Key observations:
239
+ - The model learns to identify security-related keywords and patterns
240
+ - Attention weights typically focus on vulnerability-specific terms (e.g., "overflow", "injection", "traversal")
241
+ - Performance varies by CWE category based on training data representation
242
 
243
  ## Environmental Impact
244
 
 
 
245
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
246
 
247
+ - **Hardware Type:** Apple Silicon (M-series chip)
248
+ - **Hours used:** ~1.3 hours
249
+ - **Cloud Provider:** Local training (no cloud provider)
250
+ - **Compute Region:** N/A (local)
251
+ - **Carbon Emitted:** Minimal (Apple Silicon is energy efficient, ~15W TDP)
252
 
253
  ## Technical Specifications [optional]
254
 
255
  ### Model Architecture and Objective
256
 
257
+ - **Base Architecture:** DistilBERT (distilbert-base-uncased)
258
+ - **Task:** Multi-class text classification
259
+ - **Number of labels:** 232 CWE categories
260
+ - **Objective:** Cross-entropy loss for sequence classification
261
+ - **Architecture modifications:** Added classification head with 232 output classes
262
 
263
  ### Compute Infrastructure
264
 
265
+ Local machine with Apple Silicon processor
266
 
267
  #### Hardware
268
 
269
+ - **Device:** Apple Silicon (MPS backend)
270
+ - **Memory management:** PYTORCH_MPS_HIGH_WATERMARK_RATIO set to 0.0
271
 
272
  #### Software
273
 
274
+ - **Framework:** PyTorch with Hugging Face Transformers
275
+ - **Python version:** 3.x
276
+ - **Key libraries:** transformers, torch, datasets, scikit-learn, pandas, numpy
 
 
 
 
 
 
277
 
278
+ ## Citation
279
 
280
+ If you use this model in your research, please cite:
281
 
282
+ ```bibtex
283
+ @misc{mulliken2024cwepredictcr,
284
+ author = {mulliken},
285
+ title = {CWE Predictor: A DistilBERT Model for Vulnerability Classification},
286
+ year = {2024},
287
+ publisher = {Hugging Face},
288
+ howpublished = {\url{https://huggingface.co/mulliken/cwe-predictor}}
289
+ }
290
+ ```
291
 
292
+ ## Glossary
293
 
294
+ - **CWE (Common Weakness Enumeration):** A community-developed list of software and hardware weakness types
295
+ - **CVE (Common Vulnerabilities and Exposures):** A list of publicly disclosed cybersecurity vulnerabilities
296
+ - **NVD (National Vulnerability Database):** U.S. government repository of vulnerability management data
297
+ - **Macro F1:** The unweighted mean of F1 scores calculated for each class independently
298
+ - **SAST/DAST:** Static/Dynamic Application Security Testing
299
 
300
+ ## More Information
301
 
302
+ For questions, issues, or contributions, please visit the [Hugging Face model page](https://huggingface.co/mulliken/cwe-predictor).
303
 
304
+ ## Model Card Authors
305
 
306
+ - [mulliken](https://huggingface.co/mulliken)
307
 
308
  ## Model Card Contact
309
 
310
+ Please use the Hugging Face model repository's discussion section for questions and feedback: [mulliken/cwe-predictor](https://huggingface.co/mulliken/cwe-predictor/discussions)