| | --- |
| | language: |
| | - code |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - code-review |
| | - bug-detection |
| | - codebert |
| | - python |
| | - security |
| | - static-analysis |
| | datasets: |
| | - code_search_net |
| | base_model: microsoft/codebert-base |
| | metrics: |
| | - f1 |
| | - accuracy |
| | --- |
| | |
| | # 🔍 CodeSheriff Bug Classifier |
| |
|
| | A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) — an AI system that automatically reviews GitHub pull requests. |
| |
|
| | **Base model:** `microsoft/codebert-base` · **Task:** 5-class sequence classification · **Language:** Python |
| |
|
| | --- |
| |
|
| | ## Labels |
| |
|
| | | ID | Label | Example | |
| | |----|-------|---------| |
| | | 0 | Clean | Well-formed code, no issues | |
| | | 1 | Null Reference Risk | `result.fetchone().name` without a None check | |
| | | 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int | |
| | | 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` | |
| | | 4 | Logic Flaw | `for i in range(len(items) + 1)` | |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier") |
| | model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier") |
| | |
| | LABELS = { |
| | 0: "Clean", |
| | 1: "Null Reference Risk", |
| | 2: "Type Mismatch", |
| | 3: "Security Vulnerability", |
| | 4: "Logic Flaw" |
| | } |
| | |
| | code = """ |
| | def get_user(uid): |
| | query = "SELECT * FROM users WHERE id=" + uid |
| | return db.execute(query) |
| | """ |
| | |
| | inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | |
| | probs = torch.softmax(logits, dim=-1) |
| | pred = logits.argmax(dim=-1).item() |
| | confidence = probs[0][pred].item() |
| | |
| | print(f"{LABELS[pred]} ({confidence:.1%})") |
| | # Security Vulnerability (99.3%) |
| | ```` |
| |
|
| | --- |
| |
|
| | ## Training |
| |
|
| | **Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split. |
| |
|
| | **Key hyperparameters:** |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Epochs | 4 | |
| | | Effective batch size | 16 (8 × 2 grad accum) | |
| | | Learning rate | 2e-5 | |
| | | Optimizer | AdamW + linear warmup | |
| | | Max token length | 512 | |
| | | Class weighting | Yes — balanced | |
| | | Hardware | NVIDIA RTX 3050 (4GB) | |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | Test set: 840 samples (stratified). |
| |
|
| | | Class | Precision | Recall | F1 | Support | |
| | |-------|-----------|--------|----|---------| |
| | | Clean | 0.92 | 0.88 | 0.90 | 450 | |
| | | Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 | |
| | | Type Mismatch | 0.96 | 0.95 | 0.95 | 75 | |
| | | Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 | |
| | | Logic Flaw | 0.96 | 0.97 | 0.97 | 120 | |
| | | **Macro F1** | **0.89** | **0.90** | **0.89** | | |
| |
|
| | **Confusion matrix:** |
| |
|
| | ``` |
| | Clean NullRef TypeMis SecVuln Logic |
| | Actual Clean [ 394 52 1 1 2 ] |
| | Actual NullRef [ 23 93 1 0 3 ] |
| | Actual TypeMis [ 3 1 71 0 0 ] |
| | Actual SecVuln [ 4 1 1 69 0 ] |
| | Actual Logic [ 3 0 0 0 117 ] |
| | ``` |
| |
|
| | Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs. |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - **Python only** — not trained on other languages |
| | - **Function-level input** — works best on 5–50 line snippets |
| | - **Heuristic labels** — training data was pattern-matched, not expert-annotated |
| | - **Not a SAST replacement** — probabilistic classifier, not a sound static analysis tool |
| |
|
| | --- |
| |
|
| | ## Links |
| |
|
| | - GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff) |
| | - Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff) |
| |
|
| | ``` |
| | ```` |