Update README.md
Browse files
README.md
CHANGED
|
@@ -1,149 +1,139 @@
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
-
- code
|
| 4 |
library_name: transformers
|
| 5 |
pipeline_tag: text-classification
|
| 6 |
tags:
|
| 7 |
-
- code-review
|
| 8 |
-
- bug-detection
|
| 9 |
-
- codebert
|
| 10 |
-
- python
|
| 11 |
-
- security
|
| 12 |
-
- static-analysis
|
| 13 |
datasets:
|
| 14 |
-
- code_search_net
|
| 15 |
base_model: microsoft/codebert-base
|
| 16 |
metrics:
|
| 17 |
-
- f1
|
| 18 |
-
- accuracy
|
| 19 |
-
- precision
|
| 20 |
-
- recall
|
| 21 |
-
model-index:
|
| 22 |
-
- name: codesheriff-bug-classifier
|
| 23 |
-
results:
|
| 24 |
-
- task:
|
| 25 |
-
type: text-classification
|
| 26 |
-
name: Code Bug Classification
|
| 27 |
-
dataset:
|
| 28 |
-
type: code_search_net
|
| 29 |
-
name: CodeSearchNet (Python split)
|
| 30 |
-
config: python
|
| 31 |
-
metrics:
|
| 32 |
-
- type: f1
|
| 33 |
-
value: 0.89
|
| 34 |
-
name: Macro F1
|
| 35 |
---
|
| 36 |
|
| 37 |
-
#
|
| 38 |
|
| 39 |
-
A fine-tuned **CodeBERT** model
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
-
- **Task:** Multi-class code classification (5 classes)
|
| 47 |
-
- **Language:** Python
|
| 48 |
-
- **Framework:** PyTorch + HuggingFace Transformers
|
| 49 |
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
-
|
| 53 |
-
- **General use:** Any system that needs to classify Python code snippets for potential bugs (security vulnerabilities, null references, type mismatches, logic flaws).
|
| 54 |
-
- **Out of scope:** This model is not designed for non-Python languages, natural language text, or code generation.
|
| 55 |
|
| 56 |
-
##
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
| 1 | Null Reference Risk | Potential `NoneType` access without null checks |
|
| 62 |
-
| 2 | Type Mismatch | Incompatible type operations (e.g., `str + int`) |
|
| 63 |
-
| 3 | Security Vulnerability | SQL injection, command injection, `eval()`, etc. |
|
| 64 |
-
| 4 | Logic Flaw | Off-by-one errors, division by zero, wrong logic |
|
| 65 |
|
| 66 |
-
|
|
|
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
|
|
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
| Max token length | 512 |
|
| 81 |
-
| Batch size | 8 |
|
| 82 |
-
| Gradient accumulation steps| 2 |
|
| 83 |
-
| Effective batch size | 16 |
|
| 84 |
-
| Learning rate | 2e-5 |
|
| 85 |
-
| Epochs | 4 |
|
| 86 |
-
| Optimizer | AdamW |
|
| 87 |
-
| Scheduler | Linear warmup + decay |
|
| 88 |
-
| Weight decay | 0.01 |
|
| 89 |
-
| Seed | 42 |
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
-
|
| 94 |
-
- **Python:** 3.11.9
|
| 95 |
-
- **PyTorch:** 2.x with CUDA
|
| 96 |
-
- **Transformers:** 4.35+
|
| 97 |
|
| 98 |
-
##
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
| 103 |
|-----------|-------|
|
| 104 |
-
|
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|------------------------|-----------|--------|----------|
|
| 111 |
-
| Clean | 0.92 | 0.95 | 0.93 |
|
| 112 |
-
| Null Reference Risk | 0.85 | 0.82 | 0.83 |
|
| 113 |
-
| Type Mismatch | 0.84 | 0.80 | 0.82 |
|
| 114 |
-
| Security Vulnerability | 0.96 | 0.98 | 0.97 |
|
| 115 |
-
| Logic Flaw | 0.87 | 0.86 | 0.87 |
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
| 128 |
-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 129 |
-
import torch
|
| 130 |
|
| 131 |
-
|
| 132 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 133 |
-
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
"""
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
3: "Security Vulnerability", 4: "Logic Flaw"}
|
| 148 |
-
print(f"Prediction: {labels[predicted_class]}")
|
| 149 |
-
# Output: Prediction: Security Vulnerability
|
|
|
|
| 1 |
---
|
| 2 |
language:
|
| 3 |
+
- code
|
| 4 |
library_name: transformers
|
| 5 |
pipeline_tag: text-classification
|
| 6 |
tags:
|
| 7 |
+
- code-review
|
| 8 |
+
- bug-detection
|
| 9 |
+
- codebert
|
| 10 |
+
- python
|
| 11 |
+
- security
|
| 12 |
+
- static-analysis
|
| 13 |
datasets:
|
| 14 |
+
- code_search_net
|
| 15 |
base_model: microsoft/codebert-base
|
| 16 |
metrics:
|
| 17 |
+
- f1
|
| 18 |
+
- accuracy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# 🔍 CodeSheriff Bug Classifier
|
| 22 |
|
| 23 |
+
A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) — an AI system that automatically reviews GitHub pull requests.
|
| 24 |
|
| 25 |
+
**Base model:** `microsoft/codebert-base` · **Task:** 5-class sequence classification · **Language:** Python
|
| 26 |
|
| 27 |
+
---
|
| 28 |
|
| 29 |
+
## Labels
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
| ID | Label | Example |
|
| 32 |
+
|----|-------|---------|
|
| 33 |
+
| 0 | Clean | Well-formed code, no issues |
|
| 34 |
+
| 1 | Null Reference Risk | `result.fetchone().name` without a None check |
|
| 35 |
+
| 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int |
|
| 36 |
+
| 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` |
|
| 37 |
+
| 4 | Logic Flaw | `for i in range(len(items) + 1)` |
|
| 38 |
|
| 39 |
+
---
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
## Usage
|
| 42 |
|
| 43 |
+
```python
|
| 44 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 45 |
+
import torch
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
|
| 48 |
+
model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")
|
| 49 |
|
| 50 |
+
LABELS = {
|
| 51 |
+
0: "Clean",
|
| 52 |
+
1: "Null Reference Risk",
|
| 53 |
+
2: "Type Mismatch",
|
| 54 |
+
3: "Security Vulnerability",
|
| 55 |
+
4: "Logic Flaw"
|
| 56 |
+
}
|
| 57 |
|
| 58 |
+
code = """
|
| 59 |
+
def get_user(uid):
|
| 60 |
+
query = "SELECT * FROM users WHERE id=" + uid
|
| 61 |
+
return db.execute(query)
|
| 62 |
+
"""
|
| 63 |
|
| 64 |
+
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
|
| 65 |
+
with torch.no_grad():
|
| 66 |
+
logits = model(**inputs).logits
|
| 67 |
|
| 68 |
+
probs = torch.softmax(logits, dim=-1)
|
| 69 |
+
pred = logits.argmax(dim=-1).item()
|
| 70 |
+
confidence = probs[0][pred].item()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
print(f"{LABELS[pred]} ({confidence:.1%})")
|
| 73 |
+
# Security Vulnerability (99.3%)
|
| 74 |
+
````
|
| 75 |
|
| 76 |
+
---
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
+
## Training
|
| 79 |
|
| 80 |
+
**Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.
|
| 81 |
|
| 82 |
+
**Key hyperparameters:**
|
| 83 |
+
|
| 84 |
+
| Parameter | Value |
|
| 85 |
|-----------|-------|
|
| 86 |
+
| Epochs | 4 |
|
| 87 |
+
| Effective batch size | 16 (8 × 2 grad accum) |
|
| 88 |
+
| Learning rate | 2e-5 |
|
| 89 |
+
| Optimizer | AdamW + linear warmup |
|
| 90 |
+
| Max token length | 512 |
|
| 91 |
+
| Class weighting | Yes — balanced |
|
| 92 |
+
| Hardware | NVIDIA RTX 3050 (4GB) |
|
| 93 |
|
| 94 |
+
---
|
| 95 |
|
| 96 |
+
## Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
Test set: 840 samples (stratified).
|
| 99 |
|
| 100 |
+
| Class | Precision | Recall | F1 | Support |
|
| 101 |
+
|-------|-----------|--------|----|---------|
|
| 102 |
+
| Clean | 0.92 | 0.88 | 0.90 | 450 |
|
| 103 |
+
| Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 |
|
| 104 |
+
| Type Mismatch | 0.96 | 0.95 | 0.95 | 75 |
|
| 105 |
+
| Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 |
|
| 106 |
+
| Logic Flaw | 0.96 | 0.97 | 0.97 | 120 |
|
| 107 |
+
| **Macro F1** | **0.89** | **0.90** | **0.89** | |
|
| 108 |
|
| 109 |
+
**Confusion matrix:**
|
| 110 |
|
| 111 |
+
```
|
| 112 |
+
Clean NullRef TypeMis SecVuln Logic
|
| 113 |
+
Actual Clean [ 394 52 1 1 2 ]
|
| 114 |
+
Actual NullRef [ 23 93 1 0 3 ]
|
| 115 |
+
Actual TypeMis [ 3 1 71 0 0 ]
|
| 116 |
+
Actual SecVuln [ 4 1 1 69 0 ]
|
| 117 |
+
Actual Logic [ 3 0 0 0 117 ]
|
| 118 |
+
```
|
| 119 |
|
| 120 |
+
Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.
|
| 121 |
|
| 122 |
+
---
|
|
|
|
|
|
|
| 123 |
|
| 124 |
+
## Limitations
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
- **Python only** — not trained on other languages
|
| 127 |
+
- **Function-level input** — works best on 5–50 line snippets
|
| 128 |
+
- **Heuristic labels** — training data was pattern-matched, not expert-annotated
|
| 129 |
+
- **Not a SAST replacement** — probabilistic classifier, not a sound static analysis tool
|
|
|
|
| 130 |
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
## Links
|
| 134 |
+
|
| 135 |
+
- GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff)
|
| 136 |
+
- Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff)
|
| 137 |
|
| 138 |
+
```
|
| 139 |
+
````
|
|
|
|
|
|
|
|
|