| dataset_info: | |
| features: | |
| - name: Text | |
| dtype: string | |
| - name: Label_A | |
| dtype: int64 | |
| - name: Label_B | |
| dtype: string | |
| splits: | |
| - name: train | |
| num_bytes: 160650224 | |
| num_examples: 51247 | |
| - name: validation | |
| num_bytes: 34461756 | |
| num_examples: 10983 | |
| - name: test | |
| num_bytes: 36109695 | |
| num_examples: 10963 | |
| download_size: 128366343 | |
| dataset_size: 231221675 | |
| configs: | |
| - config_name: default | |
| data_files: | |
| - split: train | |
| path: data/train-* | |
| - split: validation | |
| path: data/validation-* | |
| - split: test | |
| path: data/test-* | |
| task_categories: | |
| - text-classification | |
| language: | |
| - en | |
| # A Comprehensive Dataset for Human vs. AI Generated Text Detection | |
| This dataset is associated with the paper [A Comprehensive Dataset for Human vs. AI Generated Text Detection](https://huggingface.co/papers/2510.22874). | |
| ## Dataset Summary | |
| This comprehensive dataset comprises over 73,193 text samples designed for the detection and attribution of AI-generated text. It combines authentic New York Times articles with synthetic versions generated by several state-of-the-art Large Language Models (LLMs). The goal of the dataset is to catalyze the development of robust detection methods in the era of generative AI. | |
| ### Generative Models Included | |
| The synthetic portion of the dataset was created using the following models: | |
| - Gemma-2-9b | |
| - Mistral-7B | |
| - Qwen-2-72B | |
| - LLaMA-8B | |
| - Yi-Large | |
| - GPT-4-o | |
| ## Tasks | |
| The dataset supports two primary benchmarking tasks: | |
| 1. **Human vs. AI Detection**: Distinguishing between human-authored narratives and AI-generated text. | |
| 2. **Model Attribution**: Identifying which specific LLM generated a given piece of AI text. | |
| ## Data Structure | |
| The dataset contains the following features: | |
| - `Text`: The full narrative content (either human-authored or AI-generated). | |
| - `Label_A`: Integer label for binary classification (Human vs. AI). | |
| - `Label_B`: String label for model attribution (identifying the specific source model or "Human"). | |
| ## Citation | |
| ```bibtex | |
| @misc{roy2026comprehensivedatasethumanvs, | |
| title={A Comprehensive Dataset for Human vs. AI Generated Text Detection}, | |
| author={Rajarshi Roy and Gurpreet Singh and Ashhar Aziz and Shashwat Bajpai and Nasrin Imanpour and Shwetangshu Biswas and Kapil Wanaskar and Parth Patwa and Subhankar Ghosh and Shreyas Dixit and Nilesh Ranjan Pal and Vipula Rawte and Ritvik Garimella and Gaytri Jena and Amitava Das and Amit Sheth and Vasu Sharma and Aishwarya Naresh Reganti and Vinija Jain and Aman Chadha}, | |
| year={2026}, | |
| eprint={2510.22874}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2510.22874}, | |
| } | |
| ``` |
Xet Storage Details
- Size:
- 2.69 kB
- Xet hash:
- 405d72c1ee0c7aab904cdb49f86ad3fed13bfd28c64ce6145e8b5d032757f449
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.