--- language: - code library_name: transformers pipeline_tag: text-classification tags: - code-review - bug-detection - codebert - python - security - static-analysis datasets: - code_search_net base_model: microsoft/codebert-base metrics: - f1 - accuracy --- # ๐Ÿ” CodeSheriff Bug Classifier A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) โ€” an AI system that automatically reviews GitHub pull requests. **Base model:** `microsoft/codebert-base` ยท **Task:** 5-class sequence classification ยท **Language:** Python --- ## Labels | ID | Label | Example | |----|-------|---------| | 0 | Clean | Well-formed code, no issues | | 1 | Null Reference Risk | `result.fetchone().name` without a None check | | 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int | | 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` | | 4 | Logic Flaw | `for i in range(len(items) + 1)` | --- ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier") model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier") LABELS = { 0: "Clean", 1: "Null Reference Risk", 2: "Type Mismatch", 3: "Security Vulnerability", 4: "Logic Flaw" } code = """ def get_user(uid): query = "SELECT * FROM users WHERE id=" + uid return db.execute(query) """ inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1) pred = logits.argmax(dim=-1).item() confidence = probs[0][pred].item() print(f"{LABELS[pred]} ({confidence:.1%})") # Security Vulnerability (99.3%) ```` --- ## Training **Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split. **Key hyperparameters:** | Parameter | Value | |-----------|-------| | Epochs | 4 | | Effective batch size | 16 (8 ร— 2 grad accum) | | Learning rate | 2e-5 | | Optimizer | AdamW + linear warmup | | Max token length | 512 | | Class weighting | Yes โ€” balanced | | Hardware | NVIDIA RTX 3050 (4GB) | --- ## Evaluation Test set: 840 samples (stratified). | Class | Precision | Recall | F1 | Support | |-------|-----------|--------|----|---------| | Clean | 0.92 | 0.88 | 0.90 | 450 | | Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 | | Type Mismatch | 0.96 | 0.95 | 0.95 | 75 | | Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 | | Logic Flaw | 0.96 | 0.97 | 0.97 | 120 | | **Macro F1** | **0.89** | **0.90** | **0.89** | | **Confusion matrix:** ``` Clean NullRef TypeMis SecVuln Logic Actual Clean [ 394 52 1 1 2 ] Actual NullRef [ 23 93 1 0 3 ] Actual TypeMis [ 3 1 71 0 0 ] Actual SecVuln [ 4 1 1 69 0 ] Actual Logic [ 3 0 0 0 117 ] ``` Logic Flaw and Security Vulnerability are the strongest classes โ€” both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs. --- ## Limitations - **Python only** โ€” not trained on other languages - **Function-level input** โ€” works best on 5โ€“50 line snippets - **Heuristic labels** โ€” training data was pattern-matched, not expert-annotated - **Not a SAST replacement** โ€” probabilistic classifier, not a sound static analysis tool --- ## Links - GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff) - Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff) ``` ````