--- license: gpl-3.0 tags: - adversarial-training - vision-language-models - image-text-retrieval - multimodal - CLIP - ALBEF - BLIP --- # [WACV'26] Multimodal Adversarial Training — Resources This repository hosts model checkpoints and data resources for the paper: > **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships** > Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen > *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026* For the **source code and training scripts**, please refer to the GitHub repository: 👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training ## 📘 Overview This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness. ### Highlights - Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP). - MAT+ leverages one-to-many relationships in image-text pairs. - Reproducible results on Flickr30k and COCO benchmarks. ## 📘 Directory structure ``` resources/ ├── checkpoints/ # MAT/MAT+ model checkpoints │ ├── ALBEF_flickr_MAT_HumanCaps.pth │ ├── BLIP_flickr_MAT_HumanCaps.pth │ ├── CLIP_B_coco_MAT_HumanCaps.pth │ ├── CLIP_B_coco_MAT_base.pth │ └── CLIP_B_flickr_MAT_HumanCaps.pth └── augmentations/ # Data augmentations for MAT+ ├── dataset_json.zip # Text augmentation annotations └── flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img) ``` ## 📘 Checkpoints Adversarially trained model checkpoints for image-text retrieval: | File | Model | Dataset | Variant | |------|-------|---------|---------| | `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps | | `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps | | `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps | | `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) | | `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps | The base models used for training are: - **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF) - **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP) - **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16) ## 📘 Augmentations Data augmentations used to reproduce MAT+ results: | File | Description | |------|-------------| | `dataset_json.zip` | Text augmentation data — augmented captions and annotations in JSON format | | `flickr_SD_I2I_0.5.zip` | Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) | ## 📘 Usage 1. Clone or download this repository: ```bash # Using the Hugging Face CLI hf download cyberagent/multimodal-adversarial-training --local-dir ./resources # Or using git with LFS git lfs install git clone https://huggingface.co/cyberagent/multimodal-adversarial-training ``` 2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions. 3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources. ## 📘 Citation If you find these resources useful, please cite: ```bibtex @inproceedings{waseda2026multimodal, title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships}, author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year={2026} } ``` ## 📘 Acknowledgements This work builds upon the following repositories: - **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP) - **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA) ## 📘 License This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).