| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - text-classification |
| - ai-generated-text-detection |
| - roberta |
| - adversarial-training |
| metrics: |
| - roc_auc |
| datasets: |
| - liamdugan/raid |
| --- |
| |
| # ADAL: AI-Generated Text Detection using Adversarial Learning |
|
|
| Adversarially trained AI-generated text detector based on the RADAR framework |
| ([Hu et al., NeurIPS 2023](https://arxiv.org/abs/2307.03838)), extended with |
| a multi-evasion attack pool for robust detection. |
|
|
| ## Overview |
| |
| ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks. |
| |
| Best result: **macro AUROC 0.9951** across all 11 RAID generators, robust to all attack types. |
|
|
| ## Training |
|
|
| - **Base model**: `roberta-large` |
| - **Dataset**: [RAID](https://huggingface.co/datasets/liamdugan/raid) (Dugan et al., ACL 2024) |
| - **Evasion attacks seen during training**: t5_paraphrase, synonym_replacement, homoglyphs, article_deletion, misspelling |
| - **Best macro AUROC**: 0.9951 |
| - **Generators**: chatgpt, gpt2, gpt3, gpt4, cohere, cohere-chat, llama-chat, |
| mistral, mistral-chat, mpt, mpt-chat |
| |
| ## Architecture |
| |
| ``` |
| RAID train split (attack='none') |
| β |
| βΌ |
| ββββββββββββββ βββββββββββββββββββββββββββββββββββ |
| β xm (AI) βββββββΆβ GΟ β Paraphraser (T5-base) ββββΆ xp_ppo |
| ββββββββββββββ β ramsrigouthamg/t5_paraphraser β |
| βββββββββββββββββββββββββββββββββββ |
| β |
| PPO reward R(xp, Ο) |
| β |
| ββββββββββββββ βββββββββββββββββββββββββββββββββββ |
| β xh (human)βββββββΆβ DΟ β Detector (RoBERTa-large) ββββΆ AUROC |
| β xm (AI) βββββββΆβ roberta-large β |
| β xp_ppo βββββββΆβ (trained via reweighted β |
| β xp_det_k βββββββΆβ logistic loss) β |
| ββββββββββββββ βββββββββββββββββββββββββββββββββββ |
| ``` |
| |
| ## Usage |
| |
| ```python |
| from transformers import RobertaTokenizer, RobertaForSequenceClassification |
| import torch |
|
|
| tokenizer = RobertaTokenizer.from_pretrained("Shushant/ADAL_AI_Detector") |
| model = RobertaForSequenceClassification.from_pretrained("Shushant/ADAL_AI_Detector") |
| model.eval() |
|
|
| text = "Your text here." |
| enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| probs = torch.softmax(model(**enc).logits, dim=-1)[0] |
| print(f"P(human)={probs[1]:.3f} P(AI)={probs[0]:.3f}") |
| ``` |
| |
| ## Label mapping |
| - Index 0 β AI-generated |
| - Index 1 β Human-written |
| |
| ## Author |
| |
| **Shushanta Pudasaini ** |
| PhD Researcher, Technological University Dublin |
| Supervisors: Dr. Marisa Llorens Salvador Β· Dr. Luis Miralles-PechuΓ‘n Β· Dr. David Lillis |