antonio-t's picture
Initial upload: model checkpoints and augmentation data
b3deeeb verified
---
license: gpl-3.0
tags:
- adversarial-training
- vision-language-models
- image-text-retrieval
- multimodal
- CLIP
- ALBEF
- BLIP
---
# [WACV'26] Multimodal Adversarial Training β€” Resources
This repository hosts model checkpoints and data resources for the paper:
> **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships**
> Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
> *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026*
For the **source code and training scripts**, please refer to the GitHub repository:
πŸ‘‰ https://github.com/CyberAgentAILab/multimodal-adversarial-training
## πŸ“˜ Overview
This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness.
### Highlights
- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
- MAT+ leverages one-to-many relationships in image-text pairs.
- Reproducible results on Flickr30k and COCO benchmarks.
## πŸ“˜ Directory structure
```
resources/
β”œβ”€β”€ checkpoints/ # MAT/MAT+ model checkpoints
β”‚ β”œβ”€β”€ ALBEF_flickr_MAT_HumanCaps.pth
β”‚ β”œβ”€β”€ BLIP_flickr_MAT_HumanCaps.pth
β”‚ β”œβ”€β”€ CLIP_B_coco_MAT_HumanCaps.pth
β”‚ β”œβ”€β”€ CLIP_B_coco_MAT_base.pth
β”‚ └── CLIP_B_flickr_MAT_HumanCaps.pth
└── augmentations/ # Data augmentations for MAT+
β”œβ”€β”€ dataset_json.zip # Text augmentation annotations
└── flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img)
```
## πŸ“˜ Checkpoints
Adversarially trained model checkpoints for image-text retrieval:
| File | Model | Dataset | Variant |
|------|-------|---------|---------|
| `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps |
| `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps |
| `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps |
| `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) |
| `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps |
The base models used for training are:
- **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
- **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP)
- **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)
## πŸ“˜ Augmentations
Data augmentations used to reproduce MAT+ results:
| File | Description |
|------|-------------|
| `dataset_json.zip` | Text augmentation data β€” augmented captions and annotations in JSON format |
| `flickr_SD_I2I_0.5.zip` | Image augmentation data β€” Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |
## πŸ“˜ Usage
1. Clone or download this repository:
```bash
# Using the Hugging Face CLI
hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
# Or using git with LFS
git lfs install
git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
```
2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.
3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.
## πŸ“˜ Citation
If you find these resources useful, please cite:
```bibtex
@inproceedings{waseda2026multimodal,
title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}
```
## πŸ“˜ Acknowledgements
This work builds upon the following repositories:
- **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
- **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)
## πŸ“˜ License
This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).