jayansh21's picture
Update README.md
8b2f4fa verified
metadata
language:
  - code
library_name: transformers
pipeline_tag: text-classification
tags:
  - code-review
  - bug-detection
  - codebert
  - python
  - security
  - static-analysis
datasets:
  - code_search_net
base_model: microsoft/codebert-base
metrics:
  - f1
  - accuracy

πŸ” CodeSheriff Bug Classifier

A fine-tuned CodeBERT model that classifies Python code snippets into five bug categories. Built as the classification engine inside CodeSheriff β€” an AI system that automatically reviews GitHub pull requests.

Base model: microsoft/codebert-base Β· Task: 5-class sequence classification Β· Language: Python


Labels

ID Label Example
0 Clean Well-formed code, no issues
1 Null Reference Risk result.fetchone().name without a None check
2 Type Mismatch "Error: " + error_code where error_code is an int
3 Security Vulnerability "SELECT * FROM users WHERE id = " + user_id
4 Logic Flaw for i in range(len(items) + 1)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")

LABELS = {
    0: "Clean",
    1: "Null Reference Risk",
    2: "Type Mismatch",
    3: "Security Vulnerability",
    4: "Logic Flaw"
}

code = """
def get_user(uid):
    query = "SELECT * FROM users WHERE id=" + uid
    return db.execute(query)
"""

inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
pred = logits.argmax(dim=-1).item()
confidence = probs[0][pred].item()

print(f"{LABELS[pred]} ({confidence:.1%})")
# Security Vulnerability (99.3%)

Training

Dataset: CodeSearchNet Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.

Key hyperparameters:

Parameter Value
Epochs 4
Effective batch size 16 (8 Γ— 2 grad accum)
Learning rate 2e-5
Optimizer AdamW + linear warmup
Max token length 512
Class weighting Yes β€” balanced
Hardware NVIDIA RTX 3050 (4GB)

Evaluation

Test set: 840 samples (stratified).

Class Precision Recall F1 Support
Clean 0.92 0.88 0.90 450
Null Reference Risk 0.63 0.78 0.70 120
Type Mismatch 0.96 0.95 0.95 75
Security Vulnerability 0.99 0.92 0.95 75
Logic Flaw 0.96 0.97 0.97 120
Macro F1 0.89 0.90 0.89

Confusion matrix:

                 Clean  NullRef  TypeMis  SecVuln  Logic
Actual Clean   [  394      52        1        1      2  ]
Actual NullRef [   23      93        1        0      3  ]
Actual TypeMis [    3       1       71        0      0  ]
Actual SecVuln [    4       1        1       69      0  ]
Actual Logic   [    3       0        0        0    117  ]

Logic Flaw and Security Vulnerability are the strongest classes β€” both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.


Limitations

  • Python only β€” not trained on other languages
  • Function-level input β€” works best on 5–50 line snippets
  • Heuristic labels β€” training data was pattern-matched, not expert-annotated
  • Not a SAST replacement β€” probabilistic classifier, not a sound static analysis tool

Links