File size: 2,804 Bytes

39c26a6
af3f2bc
39c26a6
 
 
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
 
 
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
 
 
 
 
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
 
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
af3f2bc
39c26a6
 
 
af3f2bc
39c26a6
af3f2bc
39c26a6
 
af3f2bc
39c26a6
 
 
 
af3f2bc
39c26a6
 
af3f2bc
39c26a6
 
af3f2bc
39c26a6

# CodeGenDetect-CodeBERT

**Model Name:** `azherali/CodeGenDetect-CodeBert`  
**Task:** Code Generation Detection (Human vs Machine Generated Code)  
**Languages Supported:** C++, Java, Python  
**Base Model:** CodeBERT  
**Author:** Azher Ali  

---

## 📌 Model Overview

`CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks.

Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.

---

## 🎯 Intended Use Cases

This model is well-suited for:

- **Academic integrity & plagiarism detection**
- **LLM-generated code identification**
- **Code authenticity verification**
- **Research on AI-generated programming artifacts**
- **Code forensics and auditing pipelines**

---

## 🧠 Model Details

- **Architecture:** Transformer-based (CodeBERT)
- **Task Type:** Binary Sequence Classification
- **Labels:**
  - `0` → Human-generated code
  - `1` → Machine-generated (LLM) code
- **Input:** Source code as plain text
- **Output:** Class probabilities and predicted label

---

## 🌐 Supported Programming Languages

The model has been trained and evaluated on code written in:

- **C++**
- **Java**
- **Python**

It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.

---

## 🏋️ Training Summary

- **Training Objective:** Binary cross-entropy loss for classification
- **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation
- **Optimization:** Fine-tuned using modern deep learning best practices
- **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score

The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.

---

## 🚀 Example Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "azherali/CodeGenDetect-CodeBert"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

code_snippet = """
def add(a, b):
    return a + b
"""

inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"

print(label)