# Robustness of Multi-Modal Foundational Models

Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.

**Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM)

## Table of Contents

- [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
  - [Table of Contents](#table-of-contents)
  - [Prerequisites](#prerequisites)
  - [Installation](#installation)
  - [Dataset Setup](#dataset-setup)
    - [VLM Evaluation Datasets](#vlm-evaluation-datasets)
      - [1. VizWiz Dataset](#1-vizwiz-dataset)
      - [2. OK-VQA Dataset](#2-ok-vqa-dataset)
      - [3. Flickr30k Dataset](#3-flickr30k-dataset)
      - [4. COCO Dataset (2014)](#4-coco-dataset-2014)
    - [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
      - [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
      - [2. APGD Adversarial Images](#2-apgd-adversarial-images)
      - [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
      - [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
  - [Usage](#usage)
    - [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
      - [Configuration Options](#configuration-options)
      - [Running the Scripts](#running-the-scripts)
    - [Fine-tuning CLIP Models](#fine-tuning-clip-models)
      - [Parameters](#parameters)
      - [Running the Scripts](#running-the-scripts-1)
    - [Zero-Shot Image Classification](#zero-shot-image-classification)
      - [Parameters](#parameters-1)
      - [Running the Scripts](#running-the-scripts-2)
    - [Image-Text Retrieval](#image-text-retrieval)
      - [Parameters](#parameters-2)
  - [License](#license)
  - [Acknowledgments](#acknowledgments)

## Prerequisites

- **Python version:** 3.11.x
- **Java:** JDK 1.8.0_202 (required for CIDEr score computation)
- **CUDA-compatible GPU** (for model training and inference)

## Installation

1. Clone the repository and navigate to the project directory:
   ```bash
   cd Robust_mmfm
   ```

2. Install required Python packages:
   ```bash
   pip install -r requirements.txt
   ```

3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.

4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
   ```bash
   # Add to ~/.bashrc or ~/.zshrc
   export PATH=$PATH:/path/to/jdk1.8.0_202/bin
   export LANG=en_US.UTF-8
   ```

## Dataset Setup

### VLM Evaluation Datasets

#### 1. VizWiz Dataset
- Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
- Annotation files are included in the repository, but can be re-downloaded if corrupted
- Place images in:
  - `./open_flamingo_datasets/VizWiz/train`
  - `./open_flamingo_datasets/VizWiz/val`

#### 2. OK-VQA Dataset
- Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
- Annotation files are included in the repository
- Place all images in: `./open_flamingo_datasets/OKVQA`

#### 3. Flickr30k Dataset
- Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
- Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
- Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
- Place images in: `./open_flamingo_datasets/Flickr30k/Images`

#### 4. COCO Dataset (2014)
- Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
- Annotation files are included in the repository
- Alternative annotation downloads:
  - [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
  - [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
- Place images in:
  - `./open_flamingo_datasets/COCO/train2014`
  - `./open_flamingo_datasets/COCO/val2014`
### CLIP Fine-tuning Datasets

#### 1. COCO Counterfactuals (COCO-CFs)
- Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
- Unzip and place images in:
  - `./open_flamingo_datasets/COCO_CF/images`
  - `./clip_train_datasets/MS_COCO_COCO_CF/images`
- Copy original images (ending with `_0.jpg`) to:
  ```bash
  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
  ```

#### 2. APGD Adversarial Images
- Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
  - `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
  - `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`

#### 3. COCO 2017 Validation Set
- Download from [COCO website](https://cocodataset.org/#download)
- Copy images to all CLIP training dataset folders:
  - `./clip_train_datasets/MS_COCO/images`
  - `./clip_train_datasets/MS_COCO_APGD_4/images`
  - `./clip_train_datasets/MS_COCO_APGD_1/images`
  - `./clip_train_datasets/MS_COCO_COCO_CF/images`

#### 4. COCO Captions and Classification Datasets
- Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
- Place in: `./clip_train_datasets/MS_COCO`
- Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
  - `Caltech101.zip` → unzip in `./image_classification_datasets`
  - `Caltech256.zip` → unzip in `./image_classification_datasets`
- For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52

---

## Usage

### Sparse vs Non-Sparse Attacks Evaluation

Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 0 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 8 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```

#### Configuration Options

**Attack Types:**
- APGD attack: `--attack apgd --eps <epsilon>`
- SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
- No attack (clean): `--attack none`
- Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`

**Shot Settings:**
- 0-shot: `--shots 0`
- 4-shot: `--shots 4`
  - Query mode: `--mask_out context`
  - All mode: `--mask_out none`

**Evaluation Tasks:**
- Image Captioning:
  - COCO: `--eval_coco`
  - Flickr30k: `--eval_flickr30`
- Visual Question Answering:
  - VizWiz: `--eval_vizwiz`
  - OK-VQA: `--eval_ok_vqa`

**Other Options:**
- Save adversarial samples as `.pt` files: remove `--dont_save_adv`
- Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`

#### Running the Scripts

```bash
# Make scripts executable
chmod +x ./bash/run_script.sh
chmod +x ./bash/run_script_slurm.sh

# Run locally or remotely
./bash/run_script.sh

# Run on SLURM cluster
sbatch ./bash/run_script_slurm.sh
```

### Fine-tuning CLIP Models

Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):

```bash
python vlm_eval/clip_train.py \
    --num_epochs 20 \
    --data_seeds 112 113 114 115 \
    --data_name base \
    --method APGD_4 \
    --batch_size 128 \
    --learning_rate 5e-7 \
    --save_model \
    --save_model_path ./fine_tuned_clip_models/APGD_4/
```

This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).

#### Parameters

- `--data_name`: Dataset size variant
  - `MS_COCO`: Standard MS COCO (see thesis appendix)
  - `base`: Base subset
  - `medium`: Medium subset
  - `all`: Complete dataset

- `--method`: Training method
  - `APGD_4`: APGD with ε=4/255
  - `APGD_1`: APGD with ε=1/255
  - `COCO_CF`: COCO Counterfactuals
  - `NONE`: Clean MS COCO (no perturbations)

- `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)

#### Running the Scripts

```bash
# Make scripts executable
chmod +x ./bash/clip_train.sh
chmod +x ./bash/clip_train_slurm.sh

# Run locally or remotely
./bash/clip_train.sh

# Run on SLURM cluster
sbatch ./bash/clip_train_slurm.sh
```

### Zero-Shot Image Classification

Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):

```bash
python vlm_eval/clip_classification.py \
    --data base \
    --method COCO_CF \
    --dataset Caltech101
```

This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.

#### Parameters

- `--data`: Dataset variant
  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
  - `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)

- `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`

- `--dataset`: Classification dataset
  - `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`

**Note:** Evaluation is hardcoded to 20 epochs.

#### Running the Scripts

```bash
chmod +x ./bash/clip_classification.sh
chmod +x ./bash/clip_classification_slurm.sh

# Run locally or remotely
./bash/clip_classification.sh

# Run on SLURM cluster
sbatch ./bash/clip_classification_slurm.sh
```

---

### Image-Text Retrieval

Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 1 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 1000 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```

This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).

#### Parameters

- `--itr_dataset`: Dataset for fine-tuned CLIP model
  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
  - `non_fine_tuned`: Pre-trained CLIP only

**Note:** Image-text retrieval does not support targeted attacks or 4-shot settings.

---


## License

Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.

## Acknowledgments

This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.