## Datasets & Training Data ### AI-Generated vs. Human-Written Paired Corpora The most foundational resource for training text humanization models is the **Human ChatGPT Comparison Corpus (HC3)** [HC3 Dataset](https://huggingface.co/datasets/Hello-SimpleAI/HC3) (accessed 2026-06-29, confidence: High), introduced by Guo et al. (2023) [Paper](https://hf.co/papers/2301.07597). HC3 contains 48,600 question-answer pairs across six domains — Reddit ELI5, finance, medicine, Wikipedia, open QA, and computer science — where each question receives both human-written and ChatGPT-generated answers. The dataset structure (`question`, `human_answers`, `chatgpt_answers`) makes it uniquely suited for training sequence-to-sequence humanization models: the ChatGPT column serves as input and the human column as target. HC3 is available in both English and Chinese, hosted on Hugging Face with 131.7K downloads and 218 likes. A follow-up work, **HC3 Plus** [Paper](https://hf.co/papers/2309.02731), extends this to semantic-invariant tasks such as summarization and translation, which the authors demonstrate are more challenging for detectors — making it a stronger testbed for humanization quality (Su et al., 2023, confidence: Medium). The **RAID benchmark** [RAID Dataset](https://huggingface.co/datasets/liamdugan/raid) (accessed 2026-06-29, confidence: High) is the largest and most rigorous evaluation resource for AI-generated text detection and, by extension, humanization quality. Developed by Dugan et al. (2024) [Paper](https://hf.co/papers/2405.07940), RAID contains over 10 million documents spanning 11 LLMs, 11 genres (from news to creative writing), 4 decoding strategies, and 12 adversarial attacks — including paraphrasing, homoglyph substitution, and word insertion. For humanization research, RAID's `attack` column directly indexes perturbation strategies that attempt to evade detection, making it possible to filter for attack-surviving generations as training targets. The dataset uses a consistent schema (`model`, `decoding`, `attack`, `domain`, `generation`) that could be adapted to the humanization training format by pairing `generation` (AI text) with its corresponding attack-processed or human-written counterpart. The **M4 dataset** [M4 GitHub](https://github.com/mbzuai-nlp/M4) (accessed 2026-06-29, confidence: High), introduced by Wang et al. (2024) [Paper](https://hf.co/papers/2305.14902), is a multi-generator, multi-domain, and multi-lingual benchmark that won the Best Resource Paper Award at EACL 2024 and served as the foundation for SemEval 2024 Task 8. M4 covers ChatGPT, davinci-003, GPT-4, and other LLMs across domains including Wikipedia, Reddit, arXiv abstracts, and peer reviews, in multiple languages. The cross-domain and cross-generator evaluation framework revealed that detectors struggle to generalize — a property that directly motivates the need for robust humanization. Its multilingual scope makes it the best resource for developing language-agnostic humanization approaches. The **Defactify Text Dataset** [Defactify](https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset) (accessed 2026-06-29, confidence: Medium), presented by Roy et al. (2025) [Paper](https://hf.co/papers/2510.22874), pairs 73,193 authentic New York Times articles with synthetic versions generated by Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. Crucially, the human text comes from the same prompt (the article abstract), giving near-meaning-preserving pairs ideal for style transfer. Its moderate scale (73K samples) makes it practical for fine-tuning, and baseline detection accuracy of only 58.35% indicates a challenging domain for evading detection. The **MAGE dataset** [MAGE](https://huggingface.co/datasets/yaful/MAGE) (accessed 2026-06-29, confidence: Medium) provides 436,600 machine-generated text samples (319K train, 57K validation, 61K test) in the wild, with binary labels for human vs. machine origin. While not paired like HC3, its diversity — covering news, stories, and scientific writing from multiple LLMs — makes it useful for training discriminators that humanization models must fool. The **AI Peer Review Detection Benchmark** [IntelLabs Dataset](https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark) (accessed 2026-06-29, confidence: High), created by Yu et al. (2025) [Paper](https://hf.co/papers/2502.19614), is the largest corpus of paired human-AI reviews, comprising 788,984 reviews written by humans and five LLMs (GPT-4o, Claude Sonnet 3.5, Gemini 1.5 Pro, Qwen 2.5 72B, Llama 3.1 70B) for ICLR and NeurIPS papers across 8 years. With 76K calibration (training) samples and 287K test samples, this dataset captures a domain where humanization is particularly sensitive — academic writing — and provides exact human/AI parallel reviews of the same paper, making it an excellent resource. ### Benchmarks for AI Text Detection Evasion **MGTBench** [MGTBench GitHub](https://github.com/xinleihe/MGTBench) (accessed 2026-06-29, confidence: High), proposed by He et al. (2023) [Paper](https://hf.co/papers/2303.14822), was the first unified benchmark framework for machine-generated text detection. It evaluates detectors against adversarial-crafted perturbations on ChatGPT outputs and demonstrated that small perturbations can evade even the strongest detectors. For humanization, MGTBench provides the canonical evaluation protocol: apply humanization, then measure detector recall drop. **RADAR** [Paper](https://hf.co/papers/2307.03838) (Hu et al., 2023, confidence: Medium) jointly trains a paraphraser and a detector in adversarial learning, evaluated on 8 LLMs across 4 datasets. Its paraphraser-training framework — where the paraphraser learns to evade the detector — serves as a direct blueprint for training humanization models with adversarial objectives rather than paired human data. The **AITDNA benchmark** [Paper](https://hf.co/papers/2606.04906) (Dycke et al., 2026, confidence: Medium) introduces human-machine co-constructed texts annotated with edit and AI-interaction histories, allowing fine-grained evaluation of how detectors handle hybrid texts. **RealBench** [Paper](https://hf.co/papers/2510.17489) (He et al., 2025, confidence: Medium) similarly captures a spectrum of human-AI collaboration processes. Both are crucial for evaluating whether humanization introduces detectable "boundary artifacts" between natural and generated text segments. **Beemo** (Benchmark of Expert-edited Machine-generated Outputs) [Paper](https://hf.co/papers/2411.04032) (Artemova et al., 2024, confidence: Medium) includes 6.5K texts written by humans, generated by 10 instruction-fine-tuned LLMs, and edited by human experts. A key finding — that expert human editing successfully evades detection while LLM self-editing does not — defines the target quality bar for humanization: the model must produce text indistinguishable from expert-edited rather than LLM-edited text. **APT-Eval** [Paper](https://hf.co/papers/2502.15666) (Saha & Feizi, 2025, confidence: Medium) provides 11.7K AI-polished text samples at varying levels of AI involvement, showing that current detectors misclassify even minimally polished text and struggle with degree-of-involvement granularity. The **CUDRT benchmark** [Paper](https://hf.co/papers/2406.09056) (Tao et al., 2024, confidence: Medium) provides a bilingual (Chinese-English) evaluation covering five LLM operations (Create, Update, Delete, Rewrite, Translate). ### Datasets of Human Imperfections (Grammar Errors, Typos, Style Inconsistencies) The largest category of relevant resources comes from the grammatical error correction (GEC) literature, which can be **inverted** for humanization: instead of correcting errors, the model learns to introduce realistic, human-like imperfections. The primary datasets include: - **BEA-2019 Shared Task** dataset (Bryant et al., 2019), combining FCE (Cambridge learner essays), Lang-8 (learner writing), W&I+LOCNESS (native and learner academic writing), and NUCLE (learner essays) — totaling over 2 million error-annotated sentences. These are accessible through the Hugging Face `datasets` library under `"jfleg"`, `"wi_locness"`, and `"lang8"` configurations, as well as the `"bea"` configuration in some distribution sets. (confidence: High, though not surfaced directly by HF repo search due to library-based loading rather than standalone dataset repos.) - **JFLEG** (Napoles et al., 2017): 1,511 sentences with fluency-focused corrections. Useful for training "fluency-breaking" humanization where the model learns to degrade polished text rather than correct it. - **ErAConD** [Paper](https://hf.co/papers/2112.08466) (Yuan et al., 2021, confidence: Medium): the first GEC dataset targeting conversational dialog rather than formal writing, making it especially relevant for humanizing chatbot-style outputs. For style inconsistency specifically, the **e-GYAFC dataset** [Paper](https://hf.co/papers/2309.08583) (Saakyan & Muresan, 2023, confidence: Medium) provides 9,960 explainable formality style transfer instances, annotated with the linguistic operations that constitute the style change. Krishna et al.'s work [Paper](https://hf.co/papers/2010.05700) (2020) collected 15M sentences across 11 diverse styles, providing a massive resource for learning stylistic transformations (confidence: Medium). For typographical and informal writing patterns, the broader landscape includes: - **Enron Email Corpus**, used by ParaGuide [Paper](https://hf.co/papers/2308.15459) for style transfer training, contains real-world informal business communication with natural typos, sentence fragments, and inconsistent formatting (confidence: Medium). - **Social media datasets** (Twitter, Reddit) contain abundant naturally occurring human imperfections, though systematic annotation of error types remains sparse. The RoFT dataset [Paper](https://hf.co/papers/2212.12672) (Dugan et al., 2022) provides over 21,000 human annotations of AI-generated text boundaries with error classifications (confidence: Medium). ### Existing Datasets Specifically for Training Text Humanization Models **No dedicated, publicly released dataset exists whose explicit purpose is training a model to humanize AI-generated text.** This represents a clear gap in the literature. However, several approaches have constructed implicit or synthetic training data: The **HIP (Humanization by Iterative Paraphrasing)** pipeline [Paper](https://hf.co/papers/2605.19516) (Xu et al., 2026, confidence: High) minimally fine-tunes a base model into a paraphraser and applies it iteratively. The training data for the paraphraser is constructed by generating paraphrases of diverse text and scoring them against detector outputs. This constitutes a synthetic-but-functional training dataset, though not released as a standalone resource. The **Adversarial Paraphrasing** framework [Paper](https://hf.co/papers/2506.07001) (Cheng et al., 2025, confidence: Medium) uses an off-the-shelf instruction-following LLM guided by an AI text detector to produce training-free adversarial examples. **AuthorMist** [Paper](https://hf.co/papers/2503.08716) (David & Gervais, 2025, confidence: Medium) uses reinforcement learning with detector APIs as reward signals, fine-tuning a 3B-parameter model with Group Relative Policy Optimization (GRPO) to paraphrase AI text into human-like form. Both approaches demonstrate that detector feedback can serve as training signal for humanization without needing paired human-AI data, but the resulting training corpora are not standardized or publicly released. The **CoPA (Contrastive Paraphrase Attack)** method [Paper](https://hf.co/papers/2505.15337) (Fang et al., 2025, confidence: Medium) introduces a training-free approach that constructs an auxiliary machine-like word distribution and subtracts it from the LLM's human-like distribution during decoding. This approach is notable because it requires zero training data — the humanization is achieved purely at inference time via contrastive logit manipulation. **RADAR**'s paraphraser-detector co-training [Paper](https://hf.co/papers/2307.03838) provides a training methodology where the paraphraser learns to evade detection through adversarial feedback. However, the training data for the paraphraser consists of machine-generated texts from 8 LLMs (Pythia, Dolly, Palmyra, Camel, GPT-J, LLaMA, Vicuna) with the detector's feedback as the training signal, rather than explicit human-written targets (confidence: Medium). The **Stylistic Fingerprints** work [Paper](https://hf.co/papers/2505.14608) (Soto et al., 2026, confidence: Medium) demonstrated that while attacks degrade standard detectors, few-shot detectors using stylistic features remain robust — unless the humanization model is simultaneously optimized for undetectability and adherence to a specific human style. This introduces the concept of **author-specific humanization**, where training data would need to be keyed to individual human writing styles. ### Training Data Format for Humanization Models The consensus format emerging from the literature is a **sequence-to-sequence, text-to-text paradigm**: **Input:** AI-generated text (e.g., from ChatGPT, GPT-4, Claude). **Target:** Semantically equivalent text with human-like stylistic properties. The HC3 dataset naturally supports this: input = `chatgpt_answers`, target = `human_answers`, with `question` serving as optional conditioning context. The Defactify dataset maps `AI_generated_article` → `human_article`. The AI Peer Review Detection dataset maps `AI_review` → `human_review` for the same paper. For fine-tuning with diffusion language models such as ParaGuide [Paper](https://hf.co/papers/2308.15459), the data format extends to: **Input text** + **conditioning vector/description**. ParaGuide's architecture conditions diffusion on a paraphrase embedding and guides generation via gradient signals from style classifiers (formality, sentiment, authorship). This suggests that humanization training data should include style labels or style embedding vectors alongside the text pairs — for instance, `(input="ChatGPT text", target="human text", style_condition="informal, contains typos, level:high-school")`. For RL-based approaches like AuthorMist, the format is simpler: **text + detector score**. The model generates humanized text, submits it to detector APIs (GPTZero, WinstonAI, Originality.ai, etc.), and receives a score that serves as the reward. This "API-as-reward" methodology eliminates the need for explicit human-written targets, which is both a strength (scalability) and a weakness (the model may learn detector-specific shortcuts that fail against unseen detectors). A hybrid, practical format that would serve well for diffusion-model humanization could be structured as: ``` { "input": "ChatGPT-generated paragraph...", "target": "Human-written equivalent paragraph...", "metadata": { "domain": "academic|news|social_media|creative", "style_attributes": ["informal", "typos_present", "sentence_fragments"], "detector_scores": {"gptzero": 0.95, "originality": 0.87}, "generator_model": "gpt-4o", "human_author_id": "anon_1234" } } ``` This format combines the best aspects of HC3's paired structure, RAID's metadata richness, and the style-conditioning requirements of diffusion models. The `detector_scores` field enables filtering for "hard negatives" (text already classified as human by some detectors), while `human_author_id` supports author-specific humanization. The key gap remains: **no dataset currently combines all these fields at scale**, and constructing such a dataset would itself be a significant research contribution to the text humanization field.