File size: 5,257 Bytes
1f3cb2d 47ac4a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
license: mit
tags:
- multimodal
- classification
- content detection
---
# LanguageBind-MLP Model
## Model Description
This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection.
This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.
## Model Details
- **Model Type:** Multi-modal classification model based on LanguageBind
- **Architecture:** LanguageBind with MLP classifier head
- **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
- **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
- **Accepted at:** WWW'25 Resource Track
- **Modalities Supported:** Text, Image, and Audio
## Intended Use
This model is designed for detecting AI-generated content in:
- **Text:** Identifying AI-written articles, essays, responses, and general text
- **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc.
- **Audio:** Identifying synthetic speech from TTS models
### Use Cases
- Content moderation and authenticity verification
- Academic integrity checking
- Media forensics and fact-checking
- Research on AI-generated content detection
## Training Data
The model was trained on the **RU-AI dataset**, which includes:
- **245,895** real/human-generated samples
- **1,229,475** machine-generated samples
- Multiple data sources: COCO, Flickr8k, Places dataset
- AI-generated content from various models:
- Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
- Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
- Text: Various LLM-generated captions and descriptions
Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).
## Requirements
### Hardware
- NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended)
- At least **500GB** disk space for the full dataset
### Software
- Python >= 3.8
- PyTorch >= 1.13.1
- CUDA >= 11.6
## Installation
```bash
# Clone the repository
git clone https://github.com/ZhihaoZhang97/RU-AI.git
cd RU-AI
# Create virtual environment
conda create -n ruai python=3.8
conda activate ruai
# Install dependencies
pip3 install -r requirements.txt
```
## Usage
### Model Inference
```python
# See infer_languagebind_model.py in the GitHub repository
python infer_languagebind_model.py
```
Before running inference, you need to:
1. Download the dataset or prepare your own data
2. Update the data paths in `infer_languagebind_model.py`:
- `image_data_paths`
- `audio_data_paths`
- `text_data`
### Quick Start with Sample Data
```bash
# Download Flickr8k sample data
python ./download_flickr.py
# Or download the full dataset (157GB compressed, 500GB uncompressed)
python ./download_all.py
```
## Model Performance
This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.
For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).
## Limitations
- The model's performance depends on the quality and diversity of training data
- May not generalize well to AI models or techniques not represented in the training set
- Detection accuracy may vary across different modalities
- Requires significant computational resources for inference
## Ethical Considerations
This model is intended for research and legitimate content verification purposes. Users should:
- Consider privacy implications when analyzing user-generated content
- Be aware of potential biases in training data
- Use the model responsibly and not for censorship without human oversight
- Understand that detection is probabilistic and may produce false positives/negatives
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{huang2024ruai,
title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
year={2024},
eprint={2406.04906},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Acknowledgments
This work builds upon:
- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)
- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)
We appreciate the open-source community for the datasets and models that made this work possible.
## License
Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.
## Contact
For questions and issues:
- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
- Refer to the paper for contact information of the authors |