zzha6204
/

languagebind-mlp

 - multimodal
 - classification
 - content detection
+---
+# LanguageBind-MLP Model
+## Model Description
+This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection.
+This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.
+## Model Details
+- **Model Type:** Multi-modal classification model based on LanguageBind
+- **Architecture:** LanguageBind with MLP classifier head
+- **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
+- **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
+- **Accepted at:** WWW'25 Resource Track
+- **Modalities Supported:** Text, Image, and Audio
+## Intended Use
+This model is designed for detecting AI-generated content in:
+- **Text:** Identifying AI-written articles, essays, responses, and general text
+- **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc.
+- **Audio:** Identifying synthetic speech from TTS models
+### Use Cases
+- Content moderation and authenticity verification
+- Academic integrity checking
+- Media forensics and fact-checking
+- Research on AI-generated content detection
+## Training Data
+The model was trained on the **RU-AI dataset**, which includes:
+- **245,895** real/human-generated samples
+- **1,229,475** machine-generated samples
+- Multiple data sources: COCO, Flickr8k, Places dataset
+- AI-generated content from various models:
+  - Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
+  - Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
+  - Text: Various LLM-generated captions and descriptions
+Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).
+## Requirements
+### Hardware
+- NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended)
+- At least **500GB** disk space for the full dataset
+### Software
+- Python >= 3.8
+- PyTorch >= 1.13.1
+- CUDA >= 11.6
+## Installation
+```bash
+# Clone the repository
+git clone https://github.com/ZhihaoZhang97/RU-AI.git
+cd RU-AI
+# Create virtual environment
+conda create -n ruai python=3.8
+conda activate ruai
+# Install dependencies
+pip3 install -r requirements.txt
+```
+## Usage
+### Model Inference
+```python
+# See infer_languagebind_model.py in the GitHub repository
+python infer_languagebind_model.py
+```
+Before running inference, you need to:
+1. Download the dataset or prepare your own data
+2. Update the data paths in `infer_languagebind_model.py`:
+   - `image_data_paths`
+   - `audio_data_paths`
+   - `text_data`
+### Quick Start with Sample Data
+```bash
+# Download Flickr8k sample data
+python ./download_flickr.py
+# Or download the full dataset (157GB compressed, 500GB uncompressed)
+python ./download_all.py
+```
+## Model Performance
+This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.
+For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).
+## Limitations
+- The model's performance depends on the quality and diversity of training data
+- May not generalize well to AI models or techniques not represented in the training set
+- Detection accuracy may vary across different modalities
+- Requires significant computational resources for inference
+## Ethical Considerations
+This model is intended for research and legitimate content verification purposes. Users should:
+- Consider privacy implications when analyzing user-generated content
+- Be aware of potential biases in training data
+- Use the model responsibly and not for censorship without human oversight
+- Understand that detection is probabilistic and may produce false positives/negatives
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{huang2024ruai,
+  title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
+  author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
+  year={2024},
+  eprint={2406.04906},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+```
+## Acknowledgments
+This work builds upon:
+- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)
+- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)
+We appreciate the open-source community for the datasets and models that made this work possible.
+## License
+Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.
+## Contact
+For questions and issues:
+- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
+- Refer to the paper for contact information of the authors