FFFFAHHH
/

NPC_model

Text Classification

code-classification

ai-generated-content-detection

Model card Files Files and versions

NPC_model / README.md

FFFFAHHH's picture

Update README.md

c9cad93 verified 16 days ago

|

history blame contribute delete

3.43 kB

	---
	language:
	- en
	- code
	license: other
	tags:
	- pytorch
	- text-classification
	- code-classification
	- graphcodebert
	- ai-generated-content-detection
	datasets:
	- CodeNet
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: text-classification
	---

	# Model Card: GraphCodeBERT for AI-Generated Code Detection

	This model is a binary classifier fine-tuned to detect whether a given piece of source code was written by a human or generated by a Large Language Model (LLM) such as ChatGPT.

	---

	## 1. Model Details
	* Model Name: GraphCodeBERT Binary Classifier for GenAI Code Detection
	* Base Model: `microsoft/graphcodebert-base`
	* Architecture: `GraphCodeBERTForClassification` (A base GraphCodeBERT encoder coupled with a custom PyTorch linear classification head).
	* Developers/Authors: Pachanitha Saeheng
	* Model Type: Text/Code Classification (Binary: Human vs. AI-Generated).
	* Language(s): Source Code (focused primarily on Java).

	---

	## 2. Intended Use
	* Primary Use Case: To classify source code to determine their origin (Human-written = Class 0, AI-generated = Class 1).

	---

	## 3. Training Data & Preprocessing
	* Dataset Used: The model was fine-tuned on a custom, extended dataset. It utilizes data collected from the original `GPTSniffer` research and expands upon it by integrating additional source code samples collected from the `CodeNet` dataset. This combined approach creates a robust dataset containing paired samples of human-written code and their AI-generated counterparts.
	* Preprocessing: Input code must be tokenized using the standard Hugging Face `AutoTokenizer` configured for `microsoft/graphcodebert-base`.

	---

	## 4. Model Architecture & Hyperparameters
	* Encoder: `AutoModel.from_pretrained("microsoft/graphcodebert-base")` is used to capture the semantic representation and structural data-flow of the code.
	* Classifier Head: A custom PyTorch `nn.Linear` layer that maps the base model's `hidden_size` to `num_labels = 2`.
	* Optimizer: AdamW optimizer with a learning rate of `5e-5`.
	* Batch Size: 8 (during training and testing).

	---

	## 5. How to Load and Use the Model
	Because this model uses a custom PyTorch class wrapper, you should define the class before loading the `.pth` weights.

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoModel, AutoTokenizer

	# 1. Define the custom architecture
	class GraphCodeBERTForClassification(nn.Module):
	def __init__(self, model):
	super(GraphCodeBERTForClassification, self).__init__()
	self.model = model
	self.classifier = nn.Linear(self.model.config.hidden_size, 2) # 2 classes

	def forward(self, input_ids, attention_mask):
	outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
	cls_output = outputs.last_hidden_state[:, 0, :] # Extract [CLS] token representation
	logits = self.classifier(cls_output)
	return logits

	# 2. Load the base model and tokenizer
	base_model = AutoModel.from_pretrained("microsoft/graphcodebert-base")
	tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")

	# 3. Initialize the classification model and load weights
	model = GraphCodeBERTForClassification(base_model)
	model.load_state_dict(torch.load("Detect_AI.pth", map_location=torch.device('cpu')))
	model.eval()kenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base"