File size: 5,257 Bytes
1f3cb2d
 
 
 
 
 
47ac4a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: mit
tags:
- multimodal
- classification
- content detection
---

# LanguageBind-MLP Model

## Model Description

This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection.

This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.

## Model Details

- **Model Type:** Multi-modal classification model based on LanguageBind
- **Architecture:** LanguageBind with MLP classifier head
- **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
- **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
- **Accepted at:** WWW'25 Resource Track
- **Modalities Supported:** Text, Image, and Audio

## Intended Use

This model is designed for detecting AI-generated content in:
- **Text:** Identifying AI-written articles, essays, responses, and general text
- **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc.
- **Audio:** Identifying synthetic speech from TTS models

### Use Cases
- Content moderation and authenticity verification
- Academic integrity checking
- Media forensics and fact-checking
- Research on AI-generated content detection

## Training Data

The model was trained on the **RU-AI dataset**, which includes:
- **245,895** real/human-generated samples
- **1,229,475** machine-generated samples
- Multiple data sources: COCO, Flickr8k, Places dataset
- AI-generated content from various models:
  - Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
  - Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
  - Text: Various LLM-generated captions and descriptions

Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).

## Requirements

### Hardware
- NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended)
- At least **500GB** disk space for the full dataset

### Software
- Python >= 3.8
- PyTorch >= 1.13.1
- CUDA >= 11.6

## Installation

```bash
# Clone the repository
git clone https://github.com/ZhihaoZhang97/RU-AI.git
cd RU-AI

# Create virtual environment
conda create -n ruai python=3.8
conda activate ruai

# Install dependencies
pip3 install -r requirements.txt
```

## Usage

### Model Inference

```python
# See infer_languagebind_model.py in the GitHub repository
python infer_languagebind_model.py
```

Before running inference, you need to:
1. Download the dataset or prepare your own data
2. Update the data paths in `infer_languagebind_model.py`:
   - `image_data_paths`
   - `audio_data_paths`
   - `text_data`

### Quick Start with Sample Data

```bash
# Download Flickr8k sample data
python ./download_flickr.py

# Or download the full dataset (157GB compressed, 500GB uncompressed)
python ./download_all.py
```

## Model Performance

This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.

For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).

## Limitations

- The model's performance depends on the quality and diversity of training data
- May not generalize well to AI models or techniques not represented in the training set
- Detection accuracy may vary across different modalities
- Requires significant computational resources for inference

## Ethical Considerations

This model is intended for research and legitimate content verification purposes. Users should:
- Consider privacy implications when analyzing user-generated content
- Be aware of potential biases in training data
- Use the model responsibly and not for censorship without human oversight
- Understand that detection is probabilistic and may produce false positives/negatives

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{huang2024ruai,
  title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
  author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
  year={2024},
  eprint={2406.04906},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
```

## Acknowledgments

This work builds upon:
- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)
- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)

We appreciate the open-source community for the datasets and models that made this work possible.

## License

Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.

## Contact

For questions and issues:
- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
- Refer to the paper for contact information of the authors