---
license: gpl-3.0
tags:
  - adversarial-training
  - vision-language-models
  - image-text-retrieval
  - multimodal
  - CLIP
  - ALBEF
  - BLIP
---

# [WACV'26] Multimodal Adversarial Training — Resources

This repository hosts model checkpoints and data resources for the paper:

> **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships**
> Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
> *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026*

For the **source code and training scripts**, please refer to the GitHub repository:
👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training

## 📘 Overview

This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness.

### Highlights

- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
- MAT+ leverages one-to-many relationships in image-text pairs.
- Reproducible results on Flickr30k and COCO benchmarks.

## 📘 Directory structure

```
resources/
├── checkpoints/                          # MAT/MAT+ model checkpoints
│     ├── ALBEF_flickr_MAT_HumanCaps.pth
│     ├── BLIP_flickr_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_base.pth
│     └── CLIP_B_flickr_MAT_HumanCaps.pth
└── augmentations/                        # Data augmentations for MAT+
      ├── dataset_json.zip                # Text augmentation annotations
      └── flickr_SD_I2I_0.5.zip          # Image augmentations (SD img2img)
```

## 📘 Checkpoints

Adversarially trained model checkpoints for image-text retrieval:

| File | Model | Dataset | Variant |
|------|-------|---------|---------|
| `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps |
| `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps |
| `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps |
| `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) |
| `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps |

The base models used for training are:
- **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
- **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP)
- **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)

## 📘 Augmentations

Data augmentations used to reproduce MAT+ results:

| File | Description |
|------|-------------|
| `dataset_json.zip` | Text augmentation data — augmented captions and annotations in JSON format |
| `flickr_SD_I2I_0.5.zip` | Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |

## 📘 Usage

1. Clone or download this repository:
   ```bash
   # Using the Hugging Face CLI
   hf download cyberagent/multimodal-adversarial-training --local-dir ./resources

   # Or using git with LFS
   git lfs install
   git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
   ```

2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.

3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.

## 📘 Citation

If you find these resources useful, please cite:

```bibtex
@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}
```

## 📘 Acknowledgements

This work builds upon the following repositories:
- **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
- **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)

## 📘 License

This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).