| # CodeGenDetect-CodeBERT | |
| **Model Name:** `azherali/CodeGenDetect-CodeBert` | |
| **Task:** Code Generation Detection (Human vs Machine Generated Code) | |
| **Languages Supported:** C++, Java, Python | |
| **Base Model:** CodeBERT | |
| **Author:** Azher Ali | |
| --- | |
| ## π Model Overview | |
| `CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks. | |
| Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code. | |
| --- | |
| ## π― Intended Use Cases | |
| This model is well-suited for: | |
| - **Academic integrity & plagiarism detection** | |
| - **LLM-generated code identification** | |
| - **Code authenticity verification** | |
| - **Research on AI-generated programming artifacts** | |
| - **Code forensics and auditing pipelines** | |
| --- | |
| ## π§ Model Details | |
| - **Architecture:** Transformer-based (CodeBERT) | |
| - **Task Type:** Binary Sequence Classification | |
| - **Labels:** | |
| - `0` β Human-generated code | |
| - `1` β Machine-generated (LLM) code | |
| - **Input:** Source code as plain text | |
| - **Output:** Class probabilities and predicted label | |
| --- | |
| ## π Supported Programming Languages | |
| The model has been trained and evaluated on code written in: | |
| - **C++** | |
| - **Java** | |
| - **Python** | |
| It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs. | |
| --- | |
| ## ποΈ Training Summary | |
| - **Training Objective:** Binary cross-entropy loss for classification | |
| - **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation | |
| - **Optimization:** Fine-tuned using modern deep learning best practices | |
| - **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score | |
| The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance. | |
| --- | |
| ## π Example Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| model_name = "azherali/CodeGenDetect-CodeBert" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| code_snippet = """ | |
| def add(a, b): | |
| return a + b | |
| """ | |
| inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True) | |
| outputs = model(**inputs) | |
| prediction = torch.argmax(outputs.logits, dim=1).item() | |
| label = "Machine-generated" if prediction == 1 else "Human-written" | |
| print(label) | |