Initial upload: model checkpoints and augmentation data

Browse files

Files changed (8) hide show

README.md +112 -0
resources/augmentations/dataset_json.zip +3 -0
resources/augmentations/flickr_SD_I2I_0.5.zip +3 -0
resources/checkpoints/ALBEF_flickr_MAT_HumanCaps.pth +3 -0
resources/checkpoints/BLIP_flickr_MAT_HumanCaps.pth +3 -0
resources/checkpoints/CLIP_B_coco_MAT_HumanCaps.pth +3 -0
resources/checkpoints/CLIP_B_coco_MAT_base.pth +3 -0
resources/checkpoints/CLIP_B_flickr_MAT_HumanCaps.pth +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,112 @@

+---
+license: gpl-3.0
+tags:
+  - adversarial-training
+  - vision-language-models
+  - image-text-retrieval
+  - multimodal
+  - CLIP
+  - ALBEF
+  - BLIP
+---
+# [WACV'26] Multimodal Adversarial Training — Resources
+This repository hosts model checkpoints and data resources for the paper:
+> **Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships**
+> Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
+> *IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026*
+For the **source code and training scripts**, please refer to the GitHub repository:
+👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training
+## 📘 Overview
+This work proposes **Multimodal Adversarial Training (MAT)** for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, **MAT+**, additionally leverages one-to-many relationships in image-text pairs to improve robustness.
+### Highlights
+- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
+- MAT+ leverages one-to-many relationships in image-text pairs.
+- Reproducible results on Flickr30k and COCO benchmarks.
+## 📘 Directory structure
+```
+resources/
+├── checkpoints/                          # MAT/MAT+ model checkpoints
+│     ├── ALBEF_flickr_MAT_HumanCaps.pth
+│     ├── BLIP_flickr_MAT_HumanCaps.pth
+│     ├── CLIP_B_coco_MAT_HumanCaps.pth
+│     ├── CLIP_B_coco_MAT_base.pth
+│     └── CLIP_B_flickr_MAT_HumanCaps.pth
+└── augmentations/                        # Data augmentations for MAT+
+      ├── dataset_json.zip                # Text augmentation annotations
+      └── flickr_SD_I2I_0.5.zip          # Image augmentations (SD img2img)
+```
+## 📘 Checkpoints
+Adversarially trained model checkpoints for image-text retrieval:
+| File | Model | Dataset | Variant |
+|------|-------|---------|---------|
+| `ALBEF_flickr_MAT_HumanCaps.pth` | ALBEF | Flickr30k | MAT + HumanCaps |
+| `BLIP_flickr_MAT_HumanCaps.pth` | BLIP | Flickr30k | MAT + HumanCaps |
+| `CLIP_B_coco_MAT_HumanCaps.pth` | CLIP ViT-B | COCO | MAT + HumanCaps |
+| `CLIP_B_coco_MAT_base.pth` | CLIP ViT-B | COCO | MAT (base) |
+| `CLIP_B_flickr_MAT_HumanCaps.pth` | CLIP ViT-B | Flickr30k | MAT + HumanCaps |
+The base models used for training are:
+- **ALBEF**: [salesforce/ALBEF](https://github.com/salesforce/ALBEF)
+- **BLIP**: [salesforce/BLIP](https://github.com/salesforce/BLIP)
+- **CLIP**: [openai/CLIP](https://github.com/openai/CLIP) (ViT-B/16)
+## 📘 Augmentations
+Data augmentations used to reproduce MAT+ results:
+| File | Description |
+|------|-------------|
+| `dataset_json.zip` | Text augmentation data — augmented captions and annotations in JSON format |
+| `flickr_SD_I2I_0.5.zip` | Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |
+## 📘 Usage
+1. Clone or download this repository:
+   ```bash
+   # Using the Hugging Face CLI
+   hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
+   # Or using git with LFS
+   git lfs install
+   git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
+   ```
+2. Clone the [code repository](https://github.com/CyberAgentAILab/multimodal-adversarial-training) and follow its setup instructions.
+3. Update the checkpoint and data paths in `configs/` to point to the downloaded resources.
+## 📘 Citation
+If you find these resources useful, please cite:
+```bibtex
+@inproceedings{waseda2026multimodal,
+  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
+  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
+  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
+  year={2026}
+}
+```
+## 📘 Acknowledgements
+This work builds upon the following repositories:
+- **Models**: [ALBEF](https://github.com/salesforce/ALBEF), [BLIP](https://github.com/salesforce/BLIP)
+- **Attacks**: [Co-Attack](https://github.com/adversarial-for-goodness/Co-Attack), [SGA](https://github.com/Zoky-2020/SGA)
+## 📘 License
+This repository is licensed under the [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.en.html).

resources/augmentations/dataset_json.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b3c5d63f670cac425ab01e310fac34e2e37aa36375703ce10a00bb77d9ce2b2
+size 109146522

resources/augmentations/flickr_SD_I2I_0.5.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:81be9290abdb27e7d4700b2943f43eb599c90beeec48480d7c0b6545e0af3c19
+size 5505190190

resources/checkpoints/ALBEF_flickr_MAT_HumanCaps.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bec8a06b18c0ad43fe054b8da186c8b79146b41c1ff2e84c581ca3a88ef3a924
+size 3483811865

resources/checkpoints/BLIP_flickr_MAT_HumanCaps.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91185975ce808cb888fc3af456af2ee4af7613392f38c4797c84e566a5bea767
+size 3694541098

resources/checkpoints/CLIP_B_coco_MAT_HumanCaps.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d58946da7494442c8b7fc0b7502bbc4ad9a5d77d5d6d1abdddceceb96f6300d6
+size 1197209274

resources/checkpoints/CLIP_B_coco_MAT_base.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e12ddfd09a88fba3d9dde5546d30f416743a9f81da18a7b0abb9711c9c4fbddb
+size 1197209274

resources/checkpoints/CLIP_B_flickr_MAT_HumanCaps.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4253a0f1f847cd970f70cd28dec255f7a7796031d1c809a6a44438688c22d805
+size 1197209274