antonio-t commited on
Commit
b3deeeb
Β·
verified Β·
1 Parent(s): 80fa911

Initial upload: model checkpoints and augmentation data

Browse files
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - adversarial-training
5
+ - vision-language-models
6
+ - image-text-retrieval
7
+ - multimodal
8
+ - CLIP
9
+ - ALBEF
10
+ - BLIP
11
+ ---
12
+
13
+ # [WACV'26] Multimodal Adversarial Training β€” Resources
14
+
15
+ This repository hosts model checkpoints and data resources for the paper:
16
+
17
+ > **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships**
18
+ > Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
19
+ > *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026*
20
+
21
+ For the **source code and training scripts**, please refer to the GitHub repository:
22
+ πŸ‘‰ https://github.com/CyberAgentAILab/multimodal-adversarial-training
23
+
24
+ ## πŸ“˜ Overview
25
+
26
+ This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness.
27
+
28
+ ### Highlights
29
+
30
+ - Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
31
+ - MAT+ leverages one-to-many relationships in image-text pairs.
32
+ - Reproducible results on Flickr30k and COCO benchmarks.
33
+
34
+ ## πŸ“˜ Directory structure
35
+
36
+ ```
37
+ resources/
38
+ β”œβ”€β”€ checkpoints/ # MAT/MAT+ model checkpoints
39
+ β”‚ β”œβ”€β”€ ALBEF_flickr_MAT_HumanCaps.pth
40
+ β”‚ β”œβ”€β”€ BLIP_flickr_MAT_HumanCaps.pth
41
+ β”‚ β”œβ”€β”€ CLIP_B_coco_MAT_HumanCaps.pth
42
+ β”‚ β”œβ”€β”€ CLIP_B_coco_MAT_base.pth
43
+ β”‚ └── CLIP_B_flickr_MAT_HumanCaps.pth
44
+ └── augmentations/ # Data augmentations for MAT+
45
+ β”œβ”€β”€ dataset_json.zip # Text augmentation annotations
46
+ └── flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img)
47
+ ```
48
+
49
+ ## πŸ“˜ Checkpoints
50
+
51
+ Adversarially trained model checkpoints for image-text retrieval:
52
+
53
+ | File | Model | Dataset | Variant |
54
+ |------|-------|---------|---------|
55
+ | `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps |
56
+ | `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps |
57
+ | `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps |
58
+ | `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) |
59
+ | `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps |
60
+
61
+ The base models used for training are:
62
+ - **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
63
+ - **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP)
64
+ - **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)
65
+
66
+ ## πŸ“˜ Augmentations
67
+
68
+ Data augmentations used to reproduce MAT+ results:
69
+
70
+ | File | Description |
71
+ |------|-------------|
72
+ | `dataset_json.zip` | Text augmentation data β€” augmented captions and annotations in JSON format |
73
+ | `flickr_SD_I2I_0.5.zip` | Image augmentation data β€” Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |
74
+
75
+ ## πŸ“˜ Usage
76
+
77
+ 1. Clone or download this repository:
78
+ ```bash
79
+ # Using the Hugging Face CLI
80
+ hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
81
+
82
+ # Or using git with LFS
83
+ git lfs install
84
+ git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
85
+ ```
86
+
87
+ 2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.
88
+
89
+ 3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.
90
+
91
+ ## πŸ“˜ Citation
92
+
93
+ If you find these resources useful, please cite:
94
+
95
+ ```bibtex
96
+ @inproceedings{waseda2026multimodal,
97
+ title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
98
+ author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
99
+ booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
100
+ year={2026}
101
+ }
102
+ ```
103
+
104
+ ## πŸ“˜ Acknowledgements
105
+
106
+ This work builds upon the following repositories:
107
+ - **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
108
+ - **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)
109
+
110
+ ## πŸ“˜ License
111
+
112
+ This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).
resources/augmentations/dataset_json.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b3c5d63f670cac425ab01e310fac34e2e37aa36375703ce10a00bb77d9ce2b2
3
+ size 109146522
resources/augmentations/flickr_SD_I2I_0.5.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81be9290abdb27e7d4700b2943f43eb599c90beeec48480d7c0b6545e0af3c19
3
+ size 5505190190
resources/checkpoints/ALBEF_flickr_MAT_HumanCaps.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bec8a06b18c0ad43fe054b8da186c8b79146b41c1ff2e84c581ca3a88ef3a924
3
+ size 3483811865
resources/checkpoints/BLIP_flickr_MAT_HumanCaps.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91185975ce808cb888fc3af456af2ee4af7613392f38c4797c84e566a5bea767
3
+ size 3694541098
resources/checkpoints/CLIP_B_coco_MAT_HumanCaps.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d58946da7494442c8b7fc0b7502bbc4ad9a5d77d5d6d1abdddceceb96f6300d6
3
+ size 1197209274
resources/checkpoints/CLIP_B_coco_MAT_base.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e12ddfd09a88fba3d9dde5546d30f416743a9f81da18a7b0abb9711c9c4fbddb
3
+ size 1197209274
resources/checkpoints/CLIP_B_flickr_MAT_HumanCaps.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4253a0f1f847cd970f70cd28dec255f7a7796031d1c809a6a44438688c22d805
3
+ size 1197209274