File size: 3,433 Bytes
bc3015d
 
 
 
66bd942
bc3015d
 
 
 
 
 
 
 
 
 
 
 
 
 
5f71a85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66bd942
 
 
c9cad93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language:
- en
- code
license: other
tags:
- pytorch
- text-classification
- code-classification
- graphcodebert
- ai-generated-content-detection
datasets:
- CodeNet
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-classification
---

# Model Card: GraphCodeBERT for AI-Generated Code Detection

This model is a binary classifier fine-tuned to detect whether a given piece of source code was written by a human or generated by a Large Language Model (LLM) such as ChatGPT. 

---

## 1. Model Details
* **Model Name:** GraphCodeBERT Binary Classifier for GenAI Code Detection
* **Base Model:** `microsoft/graphcodebert-base`
* **Architecture:** `GraphCodeBERTForClassification` (A base GraphCodeBERT encoder coupled with a custom PyTorch linear classification head).
* **Developers/Authors:** Pachanitha Saeheng
* **Model Type:** Text/Code Classification (Binary: Human vs. AI-Generated).
* **Language(s):** Source Code (focused primarily on Java).

---

## 2. Intended Use
* **Primary Use Case:** To classify source code to determine their origin (Human-written = Class 0, AI-generated = Class 1).
  
---

## 3. Training Data & Preprocessing
* **Dataset Used:** The model was fine-tuned on a custom, extended dataset. It utilizes data collected from the original `GPTSniffer` research and expands upon it by integrating additional source code samples collected from the `CodeNet` dataset. This combined approach creates a robust dataset containing paired samples of human-written code and their AI-generated counterparts.
* **Preprocessing:** Input code must be tokenized using the standard Hugging Face `AutoTokenizer` configured for `microsoft/graphcodebert-base`.

---

## 4. Model Architecture & Hyperparameters
* **Encoder:** `AutoModel.from_pretrained("microsoft/graphcodebert-base")` is used to capture the semantic representation and structural data-flow of the code.
* **Classifier Head:** A custom PyTorch `nn.Linear` layer that maps the base model's `hidden_size` to `num_labels = 2`.
* **Optimizer:** AdamW optimizer with a learning rate of `5e-5`.
* **Batch Size:** 8 (during training and testing).

---

## 5. How to Load and Use the Model
Because this model uses a custom PyTorch class wrapper, you should define the class before loading the `.pth` weights.

```python
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

# 1. Define the custom architecture
class GraphCodeBERTForClassification(nn.Module):
    def __init__(self, model):
        super(GraphCodeBERTForClassification, self).__init__()
        self.model = model
        self.classifier = nn.Linear(self.model.config.hidden_size, 2) # 2 classes

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :] # Extract [CLS] token representation
        logits = self.classifier(cls_output)
        return logits

# 2. Load the base model and tokenizer
base_model = AutoModel.from_pretrained("microsoft/graphcodebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")

# 3. Initialize the classification model and load weights
model = GraphCodeBERTForClassification(base_model)
model.load_state_dict(torch.load("Detect_AI.pth", map_location=torch.device('cpu')))
model.eval()kenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base"