azherali's picture
Update README.md
39c26a6 verified
# CodeGenDetect-CodeBERT
**Model Name:** `azherali/CodeGenDetect-CodeBert`
**Task:** Code Generation Detection (Human vs Machine Generated Code)
**Languages Supported:** C++, Java, Python
**Base Model:** CodeBERT
**Author:** Azher Ali
---
## πŸ“Œ Model Overview
`CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks.
Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.
---
## 🎯 Intended Use Cases
This model is well-suited for:
- **Academic integrity & plagiarism detection**
- **LLM-generated code identification**
- **Code authenticity verification**
- **Research on AI-generated programming artifacts**
- **Code forensics and auditing pipelines**
---
## 🧠 Model Details
- **Architecture:** Transformer-based (CodeBERT)
- **Task Type:** Binary Sequence Classification
- **Labels:**
- `0` β†’ Human-generated code
- `1` β†’ Machine-generated (LLM) code
- **Input:** Source code as plain text
- **Output:** Class probabilities and predicted label
---
## 🌐 Supported Programming Languages
The model has been trained and evaluated on code written in:
- **C++**
- **Java**
- **Python**
It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.
---
## πŸ‹οΈ Training Summary
- **Training Objective:** Binary cross-entropy loss for classification
- **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation
- **Optimization:** Fine-tuned using modern deep learning best practices
- **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score
The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.
---
## πŸš€ Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "azherali/CodeGenDetect-CodeBert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
code_snippet = """
def add(a, b):
return a + b
"""
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"
print(label)