Shoriful025 commited on
Commit
03a21c4
·
verified ·
1 Parent(s): 591f0ea

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - code
5
+ tags:
6
+ - security
7
+ - vulnerability-detection
8
+ - codebert
9
+ - classification
10
+ license: mit
11
+ ---
12
+
13
+ # codebert_vulnerability_scanner
14
+
15
+ ## Overview
16
+
17
+ `codebert_vulnerability_scanner` is a fine-tuned RoBERTa model (specifically based on Microsoft's CodeBERT) designed to detect potential security vulnerabilities in source code snippets. It treats vulnerability detection as a binary classification task, labeling code as either `SAFE` or `VULNERABLE`.
18
+
19
+ ## Model Architecture
20
+
21
+ This model utilizes the `RobertaForSequenceClassification` architecture. It was pre-trained on the CodeSearchNet dataset (a large collection of function-level code across multiple programming languages) and subsequently fine-tuned on a curated dataset of C and C++ functions labeled with Common Weakness Enumerations (CWEs), such as buffer overflows and memory leaks.
22
+
23
+ - **Base Model:** `microsoft/codebert-base`
24
+ - **Head:** A linear classification head on top of the pooled output.
25
+ - **Input:** Source code functions (tokenized).
26
+ - **Output:** Logits for two classes: SAFE (0) and VULNERABLE (1).
27
+
28
+ ## Intended Use
29
+
30
+ This model is intended primarily for DevSecOps workflows and static analysis research.
31
+
32
+ - **Automated Code Review:** Scanning pull requests for high-risk code patterns before merging.
33
+ - **Security Auditing:** Quickly analyzing large legacy codebases to prioritize manual security reviews.
34
+ - **Research:** Benchmarking against traditional static analysis security testing (SAST) tools.
35
+
36
+ ### How to use
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
40
+ import torch
41
+
42
+ model_name = "your_username/codebert_vulnerability_scanner"
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
44
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
45
+
46
+ # Example C function snippet
47
+ code_snippet = """
48
+ void vulnerable_function(char *user_input) {
49
+ char buffer[64];
50
+ strcpy(buffer, user_input); // Potential buffer overflow
51
+ }
52
+ """
53
+
54
+ inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, max_length=512)
55
+
56
+ with torch.no_grad():
57
+ logits = model(**inputs).logits
58
+
59
+ predicted_class_id = logits.argmax().item()
60
+ labels = model.config.id2label
61
+ print(f"Prediction: {labels[predicted_class_id]}")
62
+ # Expected output: VULNERABLE