| --- |
| license: gpl-3.0 |
| tags: |
| - adversarial-training |
| - vision-language-models |
| - image-text-retrieval |
| - multimodal |
| - CLIP |
| - ALBEF |
| - BLIP |
| --- |
| |
| # [WACV'26] Multimodal Adversarial Training β Resources |
|
|
| This repository hosts model checkpoints and data resources for the paper: |
|
|
| > **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships** |
| > Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen |
| > *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026* |
|
|
| For the **source code and training scripts**, please refer to the GitHub repository: |
| π https://github.com/CyberAgentAILab/multimodal-adversarial-training |
|
|
| ## π Overview |
|
|
| This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness. |
|
|
| ### Highlights |
|
|
| - Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP). |
| - MAT+ leverages one-to-many relationships in image-text pairs. |
| - Reproducible results on Flickr30k and COCO benchmarks. |
|
|
| ## π Directory structure |
|
|
| ``` |
| resources/ |
| βββ checkpoints/ # MAT/MAT+ model checkpoints |
| β βββ ALBEF_flickr_MAT_HumanCaps.pth |
| β βββ BLIP_flickr_MAT_HumanCaps.pth |
| β βββ CLIP_B_coco_MAT_HumanCaps.pth |
| β βββ CLIP_B_coco_MAT_base.pth |
| β βββ CLIP_B_flickr_MAT_HumanCaps.pth |
| βββ augmentations/ # Data augmentations for MAT+ |
| βββ dataset_json.zip # Text augmentation annotations |
| βββ flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img) |
| ``` |
|
|
| ## π Checkpoints |
|
|
| Adversarially trained model checkpoints for image-text retrieval: |
|
|
| | File | Model | Dataset | Variant | |
| |------|-------|---------|---------| |
| | `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps | |
| | `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps | |
| | `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps | |
| | `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) | |
| | `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps | |
|
|
| The base models used for training are: |
| - **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF) |
| - **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP) |
| - **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16) |
|
|
| ## π Augmentations |
|
|
| Data augmentations used to reproduce MAT+ results: |
|
|
| | File | Description | |
| |------|-------------| |
| | `dataset_json.zip` | Text augmentation data β augmented captions and annotations in JSON format | |
| | `flickr_SD_I2I_0.5.zip` | Image augmentation data β Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) | |
|
|
| ## π Usage |
|
|
| 1. Clone or download this repository: |
| ```bash |
| # Using the Hugging Face CLI |
| hf download cyberagent/multimodal-adversarial-training --local-dir ./resources |
| |
| # Or using git with LFS |
| git lfs install |
| git clone https://huggingface.co/cyberagent/multimodal-adversarial-training |
| ``` |
|
|
| 2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions. |
|
|
| 3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources. |
|
|
| ## π Citation |
|
|
| If you find these resources useful, please cite: |
|
|
| ```bibtex |
| @inproceedings{waseda2026multimodal, |
| title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships}, |
| author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao}, |
| booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, |
| year={2026} |
| } |
| ``` |
|
|
| ## π Acknowledgements |
|
|
| This work builds upon the following repositories: |
| - **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP) |
| - **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA) |
|
|
| ## π License |
|
|
| This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html). |
|
|