File size: 3,679 Bytes
3e2bd2d
 
 
 
b598375
 
 
 
3e2bd2d
b598375
 
 
3e2bd2d
 
b598375
3e2bd2d
 
 
 
 
b598375
 
 
 
 
 
3e2bd2d
 
 
 
 
5f5f4d0
3e2bd2d
 
 
b598375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e2bd2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b598375
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
language: en
license: apache-2.0
tags:
- text-classification
- ai-generated-text-detection
- roberta
- adversarial-training
metrics:
- roc_auc
datasets:
- liamdugan/raid
---

# ADAL: AI-Generated Text Detection using Adversarial Learning

Adversarially trained AI-generated text detector based on the RADAR framework
([Hu et al., NeurIPS 2023](https://arxiv.org/abs/2307.03838)), extended with
a multi-evasion attack pool for robust detection.

## Overview
 
ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks.
 
Best result: **macro AUROC 0.9951** across all 11 RAID generators, robust to all attack types.

## Training

- **Base model**: `roberta-large`
- **Dataset**: [RAID](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024)
- **Evasion attacks seen during training**: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling
- **Best macro AUROC**: 0.9951
- **Generators**: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat,
  mistral, mistral-chat, mpt, mpt-chat

## Architecture
 
```
RAID train split (attack='none')
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  xm (AI)   │─────▢│  GΟƒ β€” Paraphraser (T5-base)     │──▢ xp_ppo
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚  ramsrigouthamg/t5_paraphraser  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                              PPO reward R(xp, Ο†)
                                        β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  xh (human)│─────▢│  DΟ• β€” Detector (RoBERTa-large)  │──▢ AUROC
   β”‚  xm (AI)   │─────▢│  roberta-large                  β”‚
   β”‚  xp_ppo    │─────▢│  (trained via reweighted        β”‚
   β”‚  xp_det_k  │─────▢│   logistic loss)                β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Usage

```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch

tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector")
model     = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector")
model.eval()

text = "Your text here."
enc  = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    probs = torch.softmax(model(**enc).logits, dim=-1)[0]
print(f"P(human)={probs[1]:.3f}  P(AI)={probs[0]:.3f}")
```

## Label mapping
- Index 0 β†’ AI-generated
- Index 1 β†’ Human-written

## Author
 
**Shushanta Pudasaini **  
PhD Researcher, Technological University Dublin
Supervisors: Dr. Marisa Llorens Salvador Β· Dr. Luis Miralles-PechuΓ‘n Β· Dr. David Lillis