| | --- |
| | language: |
| | - en |
| | - code |
| | license: other |
| | tags: |
| | - pytorch |
| | - text-classification |
| | - code-classification |
| | - graphcodebert |
| | - ai-generated-content-detection |
| | datasets: |
| | - CodeNet |
| | metrics: |
| | - accuracy |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | # Model Card: GraphCodeBERT for AI-Generated Code Detection |
| |
|
| | This model is a binary classifier fine-tuned to detect whether a given piece of source code was written by a human or generated by a Large Language Model (LLM) such as ChatGPT. |
| |
|
| | --- |
| |
|
| | ## 1. Model Details |
| | * **Model Name:** GraphCodeBERT Binary Classifier for GenAI Code Detection |
| | * **Base Model:** `microsoft/graphcodebert-base` |
| | * **Architecture:** `GraphCodeBERTForClassification` (A base GraphCodeBERT encoder coupled with a custom PyTorch linear classification head). |
| | * **Developers/Authors:** Pachanitha Saeheng |
| | * **Model Type:** Text/Code Classification (Binary: Human vs. AI-Generated). |
| | * **Language(s):** Source Code (focused primarily on Java). |
| |
|
| | --- |
| |
|
| | ## 2. Intended Use |
| | * **Primary Use Case:** To classify source code to determine their origin (Human-written = Class 0, AI-generated = Class 1). |
| | |
| | --- |
| |
|
| | ## 3. Training Data & Preprocessing |
| | * **Dataset Used:** The model was fine-tuned on a custom, extended dataset. It utilizes data collected from the original `GPTSniffer` research and expands upon it by integrating additional source code samples collected from the `CodeNet` dataset. This combined approach creates a robust dataset containing paired samples of human-written code and their AI-generated counterparts. |
| | * **Preprocessing:** Input code must be tokenized using the standard Hugging Face `AutoTokenizer` configured for `microsoft/graphcodebert-base`. |
| |
|
| | --- |
| |
|
| | ## 4. Model Architecture & Hyperparameters |
| | * **Encoder:** `AutoModel.from_pretrained("microsoft/graphcodebert-base")` is used to capture the semantic representation and structural data-flow of the code. |
| | * **Classifier Head:** A custom PyTorch `nn.Linear` layer that maps the base model's `hidden_size` to `num_labels = 2`. |
| | * **Optimizer:** AdamW optimizer with a learning rate of `5e-5`. |
| | * **Batch Size:** 8 (during training and testing). |
| |
|
| | --- |
| |
|
| | ## 5. How to Load and Use the Model |
| | Because this model uses a custom PyTorch class wrapper, you should define the class before loading the `.pth` weights. |
| |
|
| | ```python |
| | import torch |
| | import torch.nn as nn |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | # 1. Define the custom architecture |
| | class GraphCodeBERTForClassification(nn.Module): |
| | def __init__(self, model): |
| | super(GraphCodeBERTForClassification, self).__init__() |
| | self.model = model |
| | self.classifier = nn.Linear(self.model.config.hidden_size, 2) # 2 classes |
| | |
| | def forward(self, input_ids, attention_mask): |
| | outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) |
| | cls_output = outputs.last_hidden_state[:, 0, :] # Extract [CLS] token representation |
| | logits = self.classifier(cls_output) |
| | return logits |
| | |
| | # 2. Load the base model and tokenizer |
| | base_model = AutoModel.from_pretrained("microsoft/graphcodebert-base") |
| | tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base") |
| | |
| | # 3. Initialize the classification model and load weights |
| | model = GraphCodeBERTForClassification(base_model) |
| | model.load_state_dict(torch.load("Detect_AI.pth", map_location=torch.device('cpu'))) |
| | model.eval()kenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base" |