Initial upload: model checkpoints and augmentation data
Browse files- README.md +112 -0
- resources/augmentations/dataset_json.zip +3 -0
- resources/augmentations/flickr_SD_I2I_0.5.zip +3 -0
- resources/checkpoints/ALBEF_flickr_MAT_HumanCaps.pth +3 -0
- resources/checkpoints/BLIP_flickr_MAT_HumanCaps.pth +3 -0
- resources/checkpoints/CLIP_B_coco_MAT_HumanCaps.pth +3 -0
- resources/checkpoints/CLIP_B_coco_MAT_base.pth +3 -0
- resources/checkpoints/CLIP_B_flickr_MAT_HumanCaps.pth +3 -0
README.md
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: gpl-3.0
|
| 3 |
+
tags:
|
| 4 |
+
- adversarial-training
|
| 5 |
+
- vision-language-models
|
| 6 |
+
- image-text-retrieval
|
| 7 |
+
- multimodal
|
| 8 |
+
- CLIP
|
| 9 |
+
- ALBEF
|
| 10 |
+
- BLIP
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# [WACV'26] Multimodal Adversarial Training β Resources
|
| 14 |
+
|
| 15 |
+
This repository hosts model checkpoints and data resources for the paper:
|
| 16 |
+
|
| 17 |
+
> **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships**
|
| 18 |
+
> Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
|
| 19 |
+
> *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026*
|
| 20 |
+
|
| 21 |
+
For the **source code and training scripts**, please refer to the GitHub repository:
|
| 22 |
+
π https://github.com/CyberAgentAILab/multimodal-adversarial-training
|
| 23 |
+
|
| 24 |
+
## π Overview
|
| 25 |
+
|
| 26 |
+
This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness.
|
| 27 |
+
|
| 28 |
+
### Highlights
|
| 29 |
+
|
| 30 |
+
- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
|
| 31 |
+
- MAT+ leverages one-to-many relationships in image-text pairs.
|
| 32 |
+
- Reproducible results on Flickr30k and COCO benchmarks.
|
| 33 |
+
|
| 34 |
+
## π Directory structure
|
| 35 |
+
|
| 36 |
+
```
|
| 37 |
+
resources/
|
| 38 |
+
βββ checkpoints/ # MAT/MAT+ model checkpoints
|
| 39 |
+
β βββ ALBEF_flickr_MAT_HumanCaps.pth
|
| 40 |
+
β βββ BLIP_flickr_MAT_HumanCaps.pth
|
| 41 |
+
β βββ CLIP_B_coco_MAT_HumanCaps.pth
|
| 42 |
+
β βββ CLIP_B_coco_MAT_base.pth
|
| 43 |
+
β βββ CLIP_B_flickr_MAT_HumanCaps.pth
|
| 44 |
+
βββ augmentations/ # Data augmentations for MAT+
|
| 45 |
+
βββ dataset_json.zip # Text augmentation annotations
|
| 46 |
+
βββ flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## π Checkpoints
|
| 50 |
+
|
| 51 |
+
Adversarially trained model checkpoints for image-text retrieval:
|
| 52 |
+
|
| 53 |
+
| File | Model | Dataset | Variant |
|
| 54 |
+
|------|-------|---------|---------|
|
| 55 |
+
| `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps |
|
| 56 |
+
| `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps |
|
| 57 |
+
| `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps |
|
| 58 |
+
| `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) |
|
| 59 |
+
| `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps |
|
| 60 |
+
|
| 61 |
+
The base models used for training are:
|
| 62 |
+
- **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
|
| 63 |
+
- **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP)
|
| 64 |
+
- **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)
|
| 65 |
+
|
| 66 |
+
## π Augmentations
|
| 67 |
+
|
| 68 |
+
Data augmentations used to reproduce MAT+ results:
|
| 69 |
+
|
| 70 |
+
| File | Description |
|
| 71 |
+
|------|-------------|
|
| 72 |
+
| `dataset_json.zip` | Text augmentation data β augmented captions and annotations in JSON format |
|
| 73 |
+
| `flickr_SD_I2I_0.5.zip` | Image augmentation data β Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |
|
| 74 |
+
|
| 75 |
+
## π Usage
|
| 76 |
+
|
| 77 |
+
1. Clone or download this repository:
|
| 78 |
+
```bash
|
| 79 |
+
# Using the Hugging Face CLI
|
| 80 |
+
hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
|
| 81 |
+
|
| 82 |
+
# Or using git with LFS
|
| 83 |
+
git lfs install
|
| 84 |
+
git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.
|
| 88 |
+
|
| 89 |
+
3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.
|
| 90 |
+
|
| 91 |
+
## π Citation
|
| 92 |
+
|
| 93 |
+
If you find these resources useful, please cite:
|
| 94 |
+
|
| 95 |
+
```bibtex
|
| 96 |
+
@inproceedings{waseda2026multimodal,
|
| 97 |
+
title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
|
| 98 |
+
author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
|
| 99 |
+
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
|
| 100 |
+
year={2026}
|
| 101 |
+
}
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## π Acknowledgements
|
| 105 |
+
|
| 106 |
+
This work builds upon the following repositories:
|
| 107 |
+
- **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
|
| 108 |
+
- **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)
|
| 109 |
+
|
| 110 |
+
## π License
|
| 111 |
+
|
| 112 |
+
This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
resources/augmentations/dataset_json.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8b3c5d63f670cac425ab01e310fac34e2e37aa36375703ce10a00bb77d9ce2b2
|
| 3 |
+
size 109146522
|
resources/augmentations/flickr_SD_I2I_0.5.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:81be9290abdb27e7d4700b2943f43eb599c90beeec48480d7c0b6545e0af3c19
|
| 3 |
+
size 5505190190
|
resources/checkpoints/ALBEF_flickr_MAT_HumanCaps.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bec8a06b18c0ad43fe054b8da186c8b79146b41c1ff2e84c581ca3a88ef3a924
|
| 3 |
+
size 3483811865
|
resources/checkpoints/BLIP_flickr_MAT_HumanCaps.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:91185975ce808cb888fc3af456af2ee4af7613392f38c4797c84e566a5bea767
|
| 3 |
+
size 3694541098
|
resources/checkpoints/CLIP_B_coco_MAT_HumanCaps.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d58946da7494442c8b7fc0b7502bbc4ad9a5d77d5d6d1abdddceceb96f6300d6
|
| 3 |
+
size 1197209274
|
resources/checkpoints/CLIP_B_coco_MAT_base.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e12ddfd09a88fba3d9dde5546d30f416743a9f81da18a7b0abb9711c9c4fbddb
|
| 3 |
+
size 1197209274
|
resources/checkpoints/CLIP_B_flickr_MAT_HumanCaps.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4253a0f1f847cd970f70cd28dec255f7a7796031d1c809a6a44438688c22d805
|
| 3 |
+
size 1197209274
|