Update README.md
Browse files
README.md
CHANGED
|
@@ -4,4 +4,153 @@ tags:
|
|
| 4 |
- multimodal
|
| 5 |
- classification
|
| 6 |
- content detection
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- multimodal
|
| 5 |
- classification
|
| 6 |
- content detection
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# LanguageBind-MLP Model
|
| 10 |
+
|
| 11 |
+
## Model Description
|
| 12 |
+
|
| 13 |
+
This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection.
|
| 14 |
+
|
| 15 |
+
This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
- **Model Type:** Multi-modal classification model based on LanguageBind
|
| 20 |
+
- **Architecture:** LanguageBind with MLP classifier head
|
| 21 |
+
- **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
|
| 22 |
+
- **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
|
| 23 |
+
- **Accepted at:** WWW'25 Resource Track
|
| 24 |
+
- **Modalities Supported:** Text, Image, and Audio
|
| 25 |
+
|
| 26 |
+
## Intended Use
|
| 27 |
+
|
| 28 |
+
This model is designed for detecting AI-generated content in:
|
| 29 |
+
- **Text:** Identifying AI-written articles, essays, responses, and general text
|
| 30 |
+
- **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc.
|
| 31 |
+
- **Audio:** Identifying synthetic speech from TTS models
|
| 32 |
+
|
| 33 |
+
### Use Cases
|
| 34 |
+
- Content moderation and authenticity verification
|
| 35 |
+
- Academic integrity checking
|
| 36 |
+
- Media forensics and fact-checking
|
| 37 |
+
- Research on AI-generated content detection
|
| 38 |
+
|
| 39 |
+
## Training Data
|
| 40 |
+
|
| 41 |
+
The model was trained on the **RU-AI dataset**, which includes:
|
| 42 |
+
- **245,895** real/human-generated samples
|
| 43 |
+
- **1,229,475** machine-generated samples
|
| 44 |
+
- Multiple data sources: COCO, Flickr8k, Places dataset
|
| 45 |
+
- AI-generated content from various models:
|
| 46 |
+
- Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
|
| 47 |
+
- Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
|
| 48 |
+
- Text: Various LLM-generated captions and descriptions
|
| 49 |
+
|
| 50 |
+
Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).
|
| 51 |
+
|
| 52 |
+
## Requirements
|
| 53 |
+
|
| 54 |
+
### Hardware
|
| 55 |
+
- NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended)
|
| 56 |
+
- At least **500GB** disk space for the full dataset
|
| 57 |
+
|
| 58 |
+
### Software
|
| 59 |
+
- Python >= 3.8
|
| 60 |
+
- PyTorch >= 1.13.1
|
| 61 |
+
- CUDA >= 11.6
|
| 62 |
+
|
| 63 |
+
## Installation
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
# Clone the repository
|
| 67 |
+
git clone https://github.com/ZhihaoZhang97/RU-AI.git
|
| 68 |
+
cd RU-AI
|
| 69 |
+
|
| 70 |
+
# Create virtual environment
|
| 71 |
+
conda create -n ruai python=3.8
|
| 72 |
+
conda activate ruai
|
| 73 |
+
|
| 74 |
+
# Install dependencies
|
| 75 |
+
pip3 install -r requirements.txt
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
### Model Inference
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
# See infer_languagebind_model.py in the GitHub repository
|
| 84 |
+
python infer_languagebind_model.py
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
Before running inference, you need to:
|
| 88 |
+
1. Download the dataset or prepare your own data
|
| 89 |
+
2. Update the data paths in `infer_languagebind_model.py`:
|
| 90 |
+
- `image_data_paths`
|
| 91 |
+
- `audio_data_paths`
|
| 92 |
+
- `text_data`
|
| 93 |
+
|
| 94 |
+
### Quick Start with Sample Data
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
# Download Flickr8k sample data
|
| 98 |
+
python ./download_flickr.py
|
| 99 |
+
|
| 100 |
+
# Or download the full dataset (157GB compressed, 500GB uncompressed)
|
| 101 |
+
python ./download_all.py
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Model Performance
|
| 105 |
+
|
| 106 |
+
This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.
|
| 107 |
+
|
| 108 |
+
For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).
|
| 109 |
+
|
| 110 |
+
## Limitations
|
| 111 |
+
|
| 112 |
+
- The model's performance depends on the quality and diversity of training data
|
| 113 |
+
- May not generalize well to AI models or techniques not represented in the training set
|
| 114 |
+
- Detection accuracy may vary across different modalities
|
| 115 |
+
- Requires significant computational resources for inference
|
| 116 |
+
|
| 117 |
+
## Ethical Considerations
|
| 118 |
+
|
| 119 |
+
This model is intended for research and legitimate content verification purposes. Users should:
|
| 120 |
+
- Consider privacy implications when analyzing user-generated content
|
| 121 |
+
- Be aware of potential biases in training data
|
| 122 |
+
- Use the model responsibly and not for censorship without human oversight
|
| 123 |
+
- Understand that detection is probabilistic and may produce false positives/negatives
|
| 124 |
+
|
| 125 |
+
## Citation
|
| 126 |
+
|
| 127 |
+
If you use this model in your research, please cite:
|
| 128 |
+
|
| 129 |
+
```bibtex
|
| 130 |
+
@misc{huang2024ruai,
|
| 131 |
+
title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
|
| 132 |
+
author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
|
| 133 |
+
year={2024},
|
| 134 |
+
eprint={2406.04906},
|
| 135 |
+
archivePrefix={arXiv},
|
| 136 |
+
primaryClass={cs.CV}
|
| 137 |
+
}
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## Acknowledgments
|
| 141 |
+
|
| 142 |
+
This work builds upon:
|
| 143 |
+
- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)
|
| 144 |
+
- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)
|
| 145 |
+
|
| 146 |
+
We appreciate the open-source community for the datasets and models that made this work possible.
|
| 147 |
+
|
| 148 |
+
## License
|
| 149 |
+
|
| 150 |
+
Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.
|
| 151 |
+
|
| 152 |
+
## Contact
|
| 153 |
+
|
| 154 |
+
For questions and issues:
|
| 155 |
+
- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
|
| 156 |
+
- Refer to the paper for contact information of the authors
|