--- language: - en - code license: other tags: - pytorch - text-classification - code-classification - graphcodebert - ai-generated-content-detection datasets: - CodeNet metrics: - accuracy library_name: transformers pipeline_tag: text-classification --- # Model Card: GraphCodeBERT for AI-Generated Code Detection This model is a binary classifier fine-tuned to detect whether a given piece of source code was written by a human or generated by a Large Language Model (LLM) such as ChatGPT. --- ## 1. Model Details * **Model Name:** GraphCodeBERT Binary Classifier for GenAI Code Detection * **Base Model:** `microsoft/graphcodebert-base` * **Architecture:** `GraphCodeBERTForClassification` (A base GraphCodeBERT encoder coupled with a custom PyTorch linear classification head). * **Developers/Authors:** Pachanitha Saeheng * **Model Type:** Text/Code Classification (Binary: Human vs. AI-Generated). * **Language(s):** Source Code (focused primarily on Java). --- ## 2. Intended Use * **Primary Use Case:** To classify source code to determine their origin (Human-written = Class 0, AI-generated = Class 1). --- ## 3. Training Data & Preprocessing * **Dataset Used:** The model was fine-tuned on a custom, extended dataset. It utilizes data collected from the original `GPTSniffer` research and expands upon it by integrating additional source code samples collected from the `CodeNet` dataset. This combined approach creates a robust dataset containing paired samples of human-written code and their AI-generated counterparts. * **Preprocessing:** Input code must be tokenized using the standard Hugging Face `AutoTokenizer` configured for `microsoft/graphcodebert-base`. --- ## 4. Model Architecture & Hyperparameters * **Encoder:** `AutoModel.from_pretrained("microsoft/graphcodebert-base")` is used to capture the semantic representation and structural data-flow of the code. * **Classifier Head:** A custom PyTorch `nn.Linear` layer that maps the base model's `hidden_size` to `num_labels = 2`. * **Optimizer:** AdamW optimizer with a learning rate of `5e-5`. * **Batch Size:** 8 (during training and testing). --- ## 5. How to Load and Use the Model Because this model uses a custom PyTorch class wrapper, you should define the class before loading the `.pth` weights. ```python import torch import torch.nn as nn from transformers import AutoModel, AutoTokenizer # 1. Define the custom architecture class GraphCodeBERTForClassification(nn.Module): def __init__(self, model): super(GraphCodeBERTForClassification, self).__init__() self.model = model self.classifier = nn.Linear(self.model.config.hidden_size, 2) # 2 classes def forward(self, input_ids, attention_mask): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) cls_output = outputs.last_hidden_state[:, 0, :] # Extract [CLS] token representation logits = self.classifier(cls_output) return logits # 2. Load the base model and tokenizer base_model = AutoModel.from_pretrained("microsoft/graphcodebert-base") tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base") # 3. Initialize the classification model and load weights model = GraphCodeBERTForClassification(base_model) model.load_state_dict(torch.load("Detect_AI.pth", map_location=torch.device('cpu'))) model.eval()kenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base"