Shushant
/

ADAL_AI_Detector

Text Classification

ai-generated-text-detection

adversarial-training

Model card Files Files and versions

ADAL_AI_Detector / README.md

Shushant's picture

updated README

b598375 verified about 1 month ago

|

history blame contribute delete

3.68 kB

	---
	language: en
	license: apache-2.0
	tags:
	- text-classification
	- ai-generated-text-detection
	- roberta
	- adversarial-training
	metrics:
	- roc_auc
	datasets:
	- liamdugan/raid
	---

	# ADAL: AI-Generated Text Detection using Adversarial Learning

	Adversarially trained AI-generated text detector based on the RADAR framework
	([Hu et al., NeurIPS 2023](https://arxiv.org/abs/2307.03838)), extended with
	a multi-evasion attack pool for robust detection.

	## Overview

	ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks.

	Best result: macro AUROC 0.9951 across all 11 RAID generators, robust to all attack types.

	## Training

	- Base model: `roberta-large`
	- Dataset: [RAID](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024)
	- Evasion attacks seen during training: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling
	- Best macro AUROC: 0.9951
	- Generators: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat,
	mistral, mistral-chat, mpt, mpt-chat

	## Architecture

	```
	RAID train split (attack='none')
	│
	▼
	┌────────────┐ ┌─────────────────────────────────┐
	│ xm (AI) │─────▶│ Gσ — Paraphraser (T5-base) │──▶ xp_ppo
	└────────────┘ │ ramsrigouthamg/t5_paraphraser │
	└─────────────────────────────────┘
	│
	PPO reward R(xp, φ)
	│
	┌────────────┐ ┌─────────────────────────────────┐
	│ xh (human)│─────▶│ Dϕ — Detector (RoBERTa-large) │──▶ AUROC
	│ xm (AI) │─────▶│ roberta-large │
	│ xp_ppo │─────▶│ (trained via reweighted │
	│ xp_det_k │─────▶│ logistic loss) │
	└────────────┘ └─────────────────────────────────┘
	```

	## Usage

	```python
	from transformers import RobertaTokenizer, RobertaForSequenceClassification
	import torch

	tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector")
	model = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector")
	model.eval()

	text = "Your text here."
	enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	probs = torch.softmax(model(**enc).logits, dim=-1)[0]
	print(f"P(human)={probs[1]:.3f} P(AI)={probs[0]:.3f}")
	```

	## Label mapping
	- Index 0 → AI-generated
	- Index 1 → Human-written

	## Author

	Shushanta Pudasaini
	PhD Researcher, Technological University Dublin
	Supervisors: Dr. Marisa Llorens Salvador · Dr. Luis Miralles-Pechuán · Dr. David Lillis