# Robustness of Multi-Modal Foundational Models Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals. **Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM) ## Table of Contents - [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models) - [Table of Contents](#table-of-contents) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Dataset Setup](#dataset-setup) - [VLM Evaluation Datasets](#vlm-evaluation-datasets) - [1. VizWiz Dataset](#1-vizwiz-dataset) - [2. OK-VQA Dataset](#2-ok-vqa-dataset) - [3. Flickr30k Dataset](#3-flickr30k-dataset) - [4. COCO Dataset (2014)](#4-coco-dataset-2014) - [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets) - [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs) - [2. APGD Adversarial Images](#2-apgd-adversarial-images) - [3. COCO 2017 Validation Set](#3-coco-2017-validation-set) - [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets) - [Usage](#usage) - [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation) - [Configuration Options](#configuration-options) - [Running the Scripts](#running-the-scripts) - [Fine-tuning CLIP Models](#fine-tuning-clip-models) - [Parameters](#parameters) - [Running the Scripts](#running-the-scripts-1) - [Zero-Shot Image Classification](#zero-shot-image-classification) - [Parameters](#parameters-1) - [Running the Scripts](#running-the-scripts-2) - [Image-Text Retrieval](#image-text-retrieval) - [Parameters](#parameters-2) - [License](#license) - [Acknowledgments](#acknowledgments) ## Prerequisites - **Python version:** 3.11.x - **Java:** JDK 1.8.0_202 (required for CIDEr score computation) - **CUDA-compatible GPU** (for model training and inference) ## Installation 1. Clone the repository and navigate to the project directory: ```bash cd Robust_mmfm ``` 2. Install required Python packages: ```bash pip install -r requirements.txt ``` 3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`. 4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH: ```bash # Add to ~/.bashrc or ~/.zshrc export PATH=$PATH:/path/to/jdk1.8.0_202/bin export LANG=en_US.UTF-8 ``` ## Dataset Setup ### VLM Evaluation Datasets #### 1. VizWiz Dataset - Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets) - Annotation files are included in the repository, but can be re-downloaded if corrupted - Place images in: - `./open_flamingo_datasets/VizWiz/train` - `./open_flamingo_datasets/VizWiz/val` #### 2. OK-VQA Dataset - Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images) - Annotation files are included in the repository - Place all images in: `./open_flamingo_datasets/OKVQA` #### 3. Flickr30k Dataset - Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset) - Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included - Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) - Place images in: `./open_flamingo_datasets/Flickr30k/Images` #### 4. COCO Dataset (2014) - Download [COCO 2014](https://cocodataset.org/#download) train and validation sets - Annotation files are included in the repository - Alternative annotation downloads: - [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) - [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json) - Place images in: - `./open_flamingo_datasets/COCO/train2014` - `./open_flamingo_datasets/COCO/val2014` ### CLIP Fine-tuning Datasets #### 1. COCO Counterfactuals (COCO-CFs) - Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data) - Unzip and place images in: - `./open_flamingo_datasets/COCO_CF/images` - `./clip_train_datasets/MS_COCO_COCO_CF/images` - Copy original images (ending with `_0.jpg`) to: ```bash cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images ``` #### 2. APGD Adversarial Images - Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx): - `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images` - `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images` #### 3. COCO 2017 Validation Set - Download from [COCO website](https://cocodataset.org/#download) - Copy images to all CLIP training dataset folders: - `./clip_train_datasets/MS_COCO/images` - `./clip_train_datasets/MS_COCO_APGD_4/images` - `./clip_train_datasets/MS_COCO_APGD_1/images` - `./clip_train_datasets/MS_COCO_COCO_CF/images` #### 4. COCO Captions and Classification Datasets - Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) - Place in: `./clip_train_datasets/MS_COCO` - Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx): - `Caltech101.zip` → unzip in `./image_classification_datasets` - `Caltech256.zip` → unzip in `./image_classification_datasets` - For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52 --- ## Usage ### Sparse vs Non-Sparse Attacks Evaluation Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`): ```bash python -m vlm_eval.run_evaluation \ --eval_flickr30 \ --dont_save_adv \ --verbose \ --attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \ --pert_factor_graph 0 \ --itr 0 \ --itr_clip 0 \ --itr_dataset base \ --itr_method APGD_1 \ --vision_encoder_pretrained openai \ --num_samples 8 \ --trial_seeds 42 \ --num_trials 1 \ --shots 0 \ --batch_size 1 \ --results_file res9B \ --model open_flamingo \ --out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \ --vision_encoder_path ViT-L-14 \ --checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \ --lm_path anas-awadalla/mpt-7b \ --lm_tokenizer_path anas-awadalla/mpt-7b \ --precision float16 \ --cross_attn_every_n_layers 4 \ --coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \ --coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \ --coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \ --coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \ --coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \ --flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \ --flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \ --flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \ --vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \ --vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \ --vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \ --vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \ --vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \ --vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \ --vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \ --vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \ --vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \ --vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \ --vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \ --vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \ --textvqa_image_dir_path /mnt/datasets/textvqa/train_images \ --textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \ --textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \ --textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \ --textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \ --ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \ --ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \ --ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \ --ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \ --ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \ --ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json ``` #### Configuration Options **Attack Types:** - APGD attack: `--attack apgd --eps ` - SAIF attack: `--attack saif --eps --k ` - No attack (clean): `--attack none` - Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"` **Shot Settings:** - 0-shot: `--shots 0` - 4-shot: `--shots 4` - Query mode: `--mask_out context` - All mode: `--mask_out none` **Evaluation Tasks:** - Image Captioning: - COCO: `--eval_coco` - Flickr30k: `--eval_flickr30` - Visual Question Answering: - VizWiz: `--eval_vizwiz` - OK-VQA: `--eval_ok_vqa` **Other Options:** - Save adversarial samples as `.pt` files: remove `--dont_save_adv` - Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1` #### Running the Scripts ```bash # Make scripts executable chmod +x ./bash/run_script.sh chmod +x ./bash/run_script_slurm.sh # Run locally or remotely ./bash/run_script.sh # Run on SLURM cluster sbatch ./bash/run_script_slurm.sh ``` ### Fine-tuning CLIP Models Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`): ```bash python vlm_eval/clip_train.py \ --num_epochs 20 \ --data_seeds 112 113 114 115 \ --data_name base \ --method APGD_4 \ --batch_size 128 \ --learning_rate 5e-7 \ --save_model \ --save_model_path ./fine_tuned_clip_models/APGD_4/ ``` This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255). #### Parameters - `--data_name`: Dataset size variant - `MS_COCO`: Standard MS COCO (see thesis appendix) - `base`: Base subset - `medium`: Medium subset - `all`: Complete dataset - `--method`: Training method - `APGD_4`: APGD with ε=4/255 - `APGD_1`: APGD with ε=1/255 - `COCO_CF`: COCO Counterfactuals - `NONE`: Clean MS COCO (no perturbations) - `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`) #### Running the Scripts ```bash # Make scripts executable chmod +x ./bash/clip_train.sh chmod +x ./bash/clip_train_slurm.sh # Run locally or remotely ./bash/clip_train.sh # Run on SLURM cluster sbatch ./bash/clip_train_slurm.sh ``` ### Zero-Shot Image Classification Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`): ```bash python vlm_eval/clip_classification.py \ --data base \ --method COCO_CF \ --dataset Caltech101 ``` This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset. #### Parameters - `--data`: Dataset variant - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models - `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning) - `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE` - `--dataset`: Classification dataset - `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256` **Note:** Evaluation is hardcoded to 20 epochs. #### Running the Scripts ```bash chmod +x ./bash/clip_classification.sh chmod +x ./bash/clip_classification_slurm.sh # Run locally or remotely ./bash/clip_classification.sh # Run on SLURM cluster sbatch ./bash/clip_classification_slurm.sh ``` --- ### Image-Text Retrieval Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks: ```bash python -m vlm_eval.run_evaluation \ --eval_flickr30 \ --dont_save_adv \ --verbose \ --attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \ --pert_factor_graph 0 \ --itr 1 \ --itr_clip 0 \ --itr_dataset base \ --itr_method APGD_1 \ --vision_encoder_pretrained openai \ --num_samples 1000 \ --trial_seeds 42 \ --num_trials 1 \ --shots 0 \ --batch_size 1 \ --results_file res9B \ --model open_flamingo \ --out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \ --vision_encoder_path ViT-L-14 \ --checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \ --lm_path anas-awadalla/mpt-7b \ --lm_tokenizer_path anas-awadalla/mpt-7b \ --precision float16 \ --cross_attn_every_n_layers 4 \ --coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \ --coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \ --coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \ --coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \ --coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \ --flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \ --flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \ --flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \ --vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \ --vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \ --vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \ --vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \ --vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \ --vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \ --vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \ --vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \ --vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \ --vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \ --vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \ --vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \ --textvqa_image_dir_path /mnt/datasets/textvqa/train_images \ --textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \ --textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \ --textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \ --textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \ --ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \ --ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \ --ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \ --ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \ --ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \ --ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json ``` This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255). #### Parameters - `--itr_dataset`: Dataset for fine-tuned CLIP model - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants - `non_fine_tuned`: Pre-trained CLIP only **Note:** Image-text retrieval does not support targeted attacks or 4-shot settings. --- ## License Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information. ## Acknowledgments This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.