File size: 3,679 Bytes
3e2bd2d b598375 3e2bd2d b598375 3e2bd2d b598375 3e2bd2d b598375 3e2bd2d 5f5f4d0 3e2bd2d b598375 3e2bd2d b598375 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | ---
language: en
license: apache-2.0
tags:
- text-classification
- ai-generated-text-detection
- roberta
- adversarial-training
metrics:
- roc_auc
datasets:
- liamdugan/raid
---
# ADAL: AI-Generated Text Detection using Adversarial Learning
Adversarially trained AI-generated text detector based on the RADAR framework
([Hu et al., NeurIPS 2023](https://arxiv.org/abs/2307.03838)), extended with
a multi-evasion attack pool for robust detection.
## Overview
ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks.
Best result: **macro AUROC 0.9951** across all 11 RAID generators, robust to all attack types.
## Training
- **Base model**: `roberta-large`
- **Dataset**: [RAID](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024)
- **Evasion attacks seen during training**: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling
- **Best macro AUROC**: 0.9951
- **Generators**: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat,
mistral, mistral-chat, mpt, mpt-chat
## Architecture
```
RAID train split (attack='none')
β
βΌ
ββββββββββββββ βββββββββββββββββββββββββββββββββββ
β xm (AI) βββββββΆβ GΟ β Paraphraser (T5-base) ββββΆ xp_ppo
ββββββββββββββ β ramsrigouthamg/t5_paraphraser β
βββββββββββββββββββββββββββββββββββ
β
PPO reward R(xp, Ο)
β
ββββββββββββββ βββββββββββββββββββββββββββββββββββ
β xh (human)βββββββΆβ DΟ β Detector (RoBERTa-large) ββββΆ AUROC
β xm (AI) βββββββΆβ roberta-large β
β xp_ppo βββββββΆβ (trained via reweighted β
β xp_det_k βββββββΆβ logistic loss) β
ββββββββββββββ βββββββββββββββββββββββββββββββββββ
```
## Usage
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector")
model = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector")
model.eval()
text = "Your text here."
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
probs = torch.softmax(model(**enc).logits, dim=-1)[0]
print(f"P(human)={probs[1]:.3f} P(AI)={probs[0]:.3f}")
```
## Label mapping
- Index 0 β AI-generated
- Index 1 β Human-written
## Author
**Shushanta Pudasaini **
PhD Researcher, Technological University Dublin
Supervisors: Dr. Marisa Llorens Salvador Β· Dr. Luis Miralles-PechuΓ‘n Β· Dr. David Lillis |