languagebind-mlp / README.md

Update README.md

47ac4a4 verified 2 months ago

5.26 kB

	---
	license: mit
	tags:
	- multimodal
	- classification
	- content detection
	---

	# LanguageBind-MLP Model

	## Model Description

	This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the RU-AI project, which introduces a large multimodal dataset for AI-generated content detection.

	This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities.

	## Model Details

	- Model Type: Multi-modal classification model based on LanguageBind
	- Architecture: LanguageBind with MLP classifier head
	- Paper: [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
	- GitHub Repository: [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
	- Accepted at: WWW'25 Resource Track
	- Modalities Supported: Text, Image, and Audio

	## Intended Use

	This model is designed for detecting AI-generated content in:
	- Text: Identifying AI-written articles, essays, responses, and general text
	- Images: Detecting images generated by models like Stable Diffusion, DALL-E, etc.
	- Audio: Identifying synthetic speech from TTS models

	### Use Cases
	- Content moderation and authenticity verification
	- Academic integrity checking
	- Media forensics and fact-checking
	- Research on AI-generated content detection

	## Training Data

	The model was trained on the RU-AI dataset, which includes:
	- 245,895 real/human-generated samples
	- 1,229,475 machine-generated samples
	- Multiple data sources: COCO, Flickr8k, Places dataset
	- AI-generated content from various models:
	- Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
	- Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
	- Text: Various LLM-generated captions and descriptions

	Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).

	## Requirements

	### Hardware
	- NVIDIA GPU with at least 16GB VRAM (RTX 3090 24GB or higher recommended)
	- At least 500GB disk space for the full dataset

	### Software
	- Python >= 3.8
	- PyTorch >= 1.13.1
	- CUDA >= 11.6

	## Installation

	```bash
	# Clone the repository
	git clone https://github.com/ZhihaoZhang97/RU-AI.git
	cd RU-AI

	# Create virtual environment
	conda create -n ruai python=3.8
	conda activate ruai

	# Install dependencies
	pip3 install -r requirements.txt
	```

	## Usage

	### Model Inference

	```python
	# See infer_languagebind_model.py in the GitHub repository
	python infer_languagebind_model.py
	```

	Before running inference, you need to:
	1. Download the dataset or prepare your own data
	2. Update the data paths in `infer_languagebind_model.py`:
	- `image_data_paths`
	- `audio_data_paths`
	- `text_data`

	### Quick Start with Sample Data

	```bash
	# Download Flickr8k sample data
	python ./download_flickr.py

	# Or download the full dataset (157GB compressed, 500GB uncompressed)
	python ./download_all.py
	```

	## Model Performance

	This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations.

	For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).

	## Limitations

	- The model's performance depends on the quality and diversity of training data
	- May not generalize well to AI models or techniques not represented in the training set
	- Detection accuracy may vary across different modalities
	- Requires significant computational resources for inference

	## Ethical Considerations

	This model is intended for research and legitimate content verification purposes. Users should:
	- Consider privacy implications when analyzing user-generated content
	- Be aware of potential biases in training data
	- Use the model responsibly and not for censorship without human oversight
	- Understand that detection is probabilistic and may produce false positives/negatives

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{huang2024ruai,
	title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
	author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
	year={2024},
	eprint={2406.04906},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Acknowledgments

	This work builds upon:
	- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)
	- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)

	We appreciate the open-source community for the datasets and models that made this work possible.

	## License

	Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.

	## Contact

	For questions and issues:
	- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
	- Refer to the paper for contact information of the authors