Initial upload: model checkpoints and augmentation data

b3deeeb verified 25 days ago

4.35 kB

	---
	license: gpl-3.0
	tags:
	- adversarial-training
	- vision-language-models
	- image-text-retrieval
	- multimodal
	- CLIP
	- ALBEF
	- BLIP
	---

	# [WACV'26] Multimodal Adversarial Training — Resources

	This repository hosts model checkpoints and data resources for the paper:

	> Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
	> Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
	> IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

	For the source code and training scripts, please refer to the GitHub repository:
	👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training

	## 📘 Overview

	This work proposes Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, MAT+, additionally leverages one-to-many relationships in image-text pairs to improve robustness.

	### Highlights

	- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
	- MAT+ leverages one-to-many relationships in image-text pairs.
	- Reproducible results on Flickr30k and COCO benchmarks.

	## 📘 Directory structure

	```
	resources/
	├── checkpoints/ # MAT/MAT+ model checkpoints
	│ ├── ALBEF_flickr_MAT_HumanCaps.pth
	│ ├── BLIP_flickr_MAT_HumanCaps.pth
	│ ├── CLIP_B_coco_MAT_HumanCaps.pth
	│ ├── CLIP_B_coco_MAT_base.pth
	│ └── CLIP_B_flickr_MAT_HumanCaps.pth
	└── augmentations/ # Data augmentations for MAT+
	├── dataset_json.zip # Text augmentation annotations
	└── flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img)
	```

	## 📘 Checkpoints

	Adversarially trained model checkpoints for image-text retrieval:

	\| File \| Model \| Dataset \| Variant \|
	\|------\|-------\|---------\|---------\|
	\| `ALBEF_flickr_MAT_HumanCaps.pth` \| ALBEF \| Flickr30k \| MAT + HumanCaps \|
	\| `BLIP_flickr_MAT_HumanCaps.pth` \| BLIP \| Flickr30k \| MAT + HumanCaps \|
	\| `CLIP_B_coco_MAT_HumanCaps.pth` \| CLIP ViT-B \| COCO \| MAT + HumanCaps \|
	\| `CLIP_B_coco_MAT_base.pth` \| CLIP ViT-B \| COCO \| MAT (base) \|
	\| `CLIP_B_flickr_MAT_HumanCaps.pth` \| CLIP ViT-B \| Flickr30k \| MAT + HumanCaps \|

	The base models used for training are:
	- ALBEF: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
	- BLIP: [salesforce/BLIP](https://github.com/salesforce/BLIP)
	- CLIP: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)

	## 📘 Augmentations

	Data augmentations used to reproduce MAT+ results:

	\| File \| Description \|
	\|------\|-------------\|
	\| `dataset_json.zip` \| Text augmentation data — augmented captions and annotations in JSON format \|
	\| `flickr_SD_I2I_0.5.zip` \| Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) \|

	## 📘 Usage

	1. Clone or download this repository:
	```bash
	# Using the Hugging Face CLI
	hf download cyberagent/multimodal-adversarial-training --local-dir ./resources

	# Or using git with LFS
	git lfs install
	git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
	```

	2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.

	3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.

	## 📘 Citation

	If you find these resources useful, please cite:

	```bibtex
	@inproceedings{waseda2026multimodal,
	title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
	author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
	booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
	year={2026}
	}
	```

	## 📘 Acknowledgements

	This work builds upon the following repositories:
	- Models: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
	- Attacks: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)

	## 📘 License

	This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).