RobustMMFM / README.md
KC123hello's picture
Upload 3 files
f0d69f7 verified
# Robustness of Multi-Modal Foundational Models
Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.
**Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM)
## Table of Contents
- [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
- [Table of Contents](#table-of-contents)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Dataset Setup](#dataset-setup)
- [VLM Evaluation Datasets](#vlm-evaluation-datasets)
- [1. VizWiz Dataset](#1-vizwiz-dataset)
- [2. OK-VQA Dataset](#2-ok-vqa-dataset)
- [3. Flickr30k Dataset](#3-flickr30k-dataset)
- [4. COCO Dataset (2014)](#4-coco-dataset-2014)
- [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
- [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
- [2. APGD Adversarial Images](#2-apgd-adversarial-images)
- [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
- [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
- [Usage](#usage)
- [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
- [Configuration Options](#configuration-options)
- [Running the Scripts](#running-the-scripts)
- [Fine-tuning CLIP Models](#fine-tuning-clip-models)
- [Parameters](#parameters)
- [Running the Scripts](#running-the-scripts-1)
- [Zero-Shot Image Classification](#zero-shot-image-classification)
- [Parameters](#parameters-1)
- [Running the Scripts](#running-the-scripts-2)
- [Image-Text Retrieval](#image-text-retrieval)
- [Parameters](#parameters-2)
- [License](#license)
- [Acknowledgments](#acknowledgments)
## Prerequisites
- **Python version:** 3.11.x
- **Java:** JDK 1.8.0_202 (required for CIDEr score computation)
- **CUDA-compatible GPU** (for model training and inference)
## Installation
1. Clone the repository and navigate to the project directory:
```bash
cd Robust_mmfm
```
2. Install required Python packages:
```bash
pip install -r requirements.txt
```
3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.
4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
```bash
# Add to ~/.bashrc or ~/.zshrc
export PATH=$PATH:/path/to/jdk1.8.0_202/bin
export LANG=en_US.UTF-8
```
## Dataset Setup
### VLM Evaluation Datasets
#### 1. VizWiz Dataset
- Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
- Annotation files are included in the repository, but can be re-downloaded if corrupted
- Place images in:
- `./open_flamingo_datasets/VizWiz/train`
- `./open_flamingo_datasets/VizWiz/val`
#### 2. OK-VQA Dataset
- Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
- Annotation files are included in the repository
- Place all images in: `./open_flamingo_datasets/OKVQA`
#### 3. Flickr30k Dataset
- Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
- Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
- Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
- Place images in: `./open_flamingo_datasets/Flickr30k/Images`
#### 4. COCO Dataset (2014)
- Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
- Annotation files are included in the repository
- Alternative annotation downloads:
- [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
- [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
- Place images in:
- `./open_flamingo_datasets/COCO/train2014`
- `./open_flamingo_datasets/COCO/val2014`
### CLIP Fine-tuning Datasets
#### 1. COCO Counterfactuals (COCO-CFs)
- Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
- Unzip and place images in:
- `./open_flamingo_datasets/COCO_CF/images`
- `./clip_train_datasets/MS_COCO_COCO_CF/images`
- Copy original images (ending with `_0.jpg`) to:
```bash
cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
```
#### 2. APGD Adversarial Images
- Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
- `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
- `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`
#### 3. COCO 2017 Validation Set
- Download from [COCO website](https://cocodataset.org/#download)
- Copy images to all CLIP training dataset folders:
- `./clip_train_datasets/MS_COCO/images`
- `./clip_train_datasets/MS_COCO_APGD_4/images`
- `./clip_train_datasets/MS_COCO_APGD_1/images`
- `./clip_train_datasets/MS_COCO_COCO_CF/images`
#### 4. COCO Captions and Classification Datasets
- Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
- Place in: `./clip_train_datasets/MS_COCO`
- Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
- `Caltech101.zip` → unzip in `./image_classification_datasets`
- `Caltech256.zip` → unzip in `./image_classification_datasets`
- For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52
---
## Usage
### Sparse vs Non-Sparse Attacks Evaluation
Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 0 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 8 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```
#### Configuration Options
**Attack Types:**
- APGD attack: `--attack apgd --eps <epsilon>`
- SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
- No attack (clean): `--attack none`
- Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`
**Shot Settings:**
- 0-shot: `--shots 0`
- 4-shot: `--shots 4`
- Query mode: `--mask_out context`
- All mode: `--mask_out none`
**Evaluation Tasks:**
- Image Captioning:
- COCO: `--eval_coco`
- Flickr30k: `--eval_flickr30`
- Visual Question Answering:
- VizWiz: `--eval_vizwiz`
- OK-VQA: `--eval_ok_vqa`
**Other Options:**
- Save adversarial samples as `.pt` files: remove `--dont_save_adv`
- Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`
#### Running the Scripts
```bash
# Make scripts executable
chmod +x ./bash/run_script.sh
chmod +x ./bash/run_script_slurm.sh
# Run locally or remotely
./bash/run_script.sh
# Run on SLURM cluster
sbatch ./bash/run_script_slurm.sh
```
### Fine-tuning CLIP Models
Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):
```bash
python vlm_eval/clip_train.py \
--num_epochs 20 \
--data_seeds 112 113 114 115 \
--data_name base \
--method APGD_4 \
--batch_size 128 \
--learning_rate 5e-7 \
--save_model \
--save_model_path ./fine_tuned_clip_models/APGD_4/
```
This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).
#### Parameters
- `--data_name`: Dataset size variant
- `MS_COCO`: Standard MS COCO (see thesis appendix)
- `base`: Base subset
- `medium`: Medium subset
- `all`: Complete dataset
- `--method`: Training method
- `APGD_4`: APGD with ε=4/255
- `APGD_1`: APGD with ε=1/255
- `COCO_CF`: COCO Counterfactuals
- `NONE`: Clean MS COCO (no perturbations)
- `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)
#### Running the Scripts
```bash
# Make scripts executable
chmod +x ./bash/clip_train.sh
chmod +x ./bash/clip_train_slurm.sh
# Run locally or remotely
./bash/clip_train.sh
# Run on SLURM cluster
sbatch ./bash/clip_train_slurm.sh
```
### Zero-Shot Image Classification
Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):
```bash
python vlm_eval/clip_classification.py \
--data base \
--method COCO_CF \
--dataset Caltech101
```
This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.
#### Parameters
- `--data`: Dataset variant
- `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
- `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)
- `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`
- `--dataset`: Classification dataset
- `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`
**Note:** Evaluation is hardcoded to 20 epochs.
#### Running the Scripts
```bash
chmod +x ./bash/clip_classification.sh
chmod +x ./bash/clip_classification_slurm.sh
# Run locally or remotely
./bash/clip_classification.sh
# Run on SLURM cluster
sbatch ./bash/clip_classification_slurm.sh
```
---
### Image-Text Retrieval
Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 1 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 1000 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```
This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).
#### Parameters
- `--itr_dataset`: Dataset for fine-tuned CLIP model
- `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
- `non_fine_tuned`: Pre-trained CLIP only
**Note:** Image-text retrieval does not support targeted attacks or 4-shot settings.
---
## License
Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.
## Acknowledgments
This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.