Spaces:

KC123hello
/

RobustMMFM

Configuration error

App Files Files Community

RobustMMFM / README.md

KC123hello

Upload 3 files

f0d69f7 verified 2 months ago

preview code

raw

history blame contribute delete

18.7 kB

	# Robustness of Multi-Modal Foundational Models

	Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.

	Code adapted from: [RobustVLM](https://github.com/chs20/RobustVLM)

	## Table of Contents

	- [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
	- [Table of Contents](#table-of-contents)
	- [Prerequisites](#prerequisites)
	- [Installation](#installation)
	- [Dataset Setup](#dataset-setup)
	- [VLM Evaluation Datasets](#vlm-evaluation-datasets)
	- [1. VizWiz Dataset](#1-vizwiz-dataset)
	- [2. OK-VQA Dataset](#2-ok-vqa-dataset)
	- [3. Flickr30k Dataset](#3-flickr30k-dataset)
	- [4. COCO Dataset (2014)](#4-coco-dataset-2014)
	- [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
	- [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
	- [2. APGD Adversarial Images](#2-apgd-adversarial-images)
	- [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
	- [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
	- [Usage](#usage)
	- [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
	- [Configuration Options](#configuration-options)
	- [Running the Scripts](#running-the-scripts)
	- [Fine-tuning CLIP Models](#fine-tuning-clip-models)
	- [Parameters](#parameters)
	- [Running the Scripts](#running-the-scripts-1)
	- [Zero-Shot Image Classification](#zero-shot-image-classification)
	- [Parameters](#parameters-1)
	- [Running the Scripts](#running-the-scripts-2)
	- [Image-Text Retrieval](#image-text-retrieval)
	- [Parameters](#parameters-2)
	- [License](#license)
	- [Acknowledgments](#acknowledgments)

	## Prerequisites

	- Python version: 3.11.x
	- Java: JDK 1.8.0_202 (required for CIDEr score computation)
	- CUDA-compatible GPU (for model training and inference)

	## Installation

	1. Clone the repository and navigate to the project directory:
	```bash
	cd Robust_mmfm
	```

	2. Install required Python packages:
	```bash
	pip install -r requirements.txt
	```

	3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.

	4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
	```bash
	# Add to ~/.bashrc or ~/.zshrc
	export PATH=$PATH:/path/to/jdk1.8.0_202/bin
	export LANG=en_US.UTF-8
	```

	## Dataset Setup

	### VLM Evaluation Datasets

	#### 1. VizWiz Dataset
	- Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
	- Annotation files are included in the repository, but can be re-downloaded if corrupted
	- Place images in:
	- `./open_flamingo_datasets/VizWiz/train`
	- `./open_flamingo_datasets/VizWiz/val`

	#### 2. OK-VQA Dataset
	- Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
	- Annotation files are included in the repository
	- Place all images in: `./open_flamingo_datasets/OKVQA`

	#### 3. Flickr30k Dataset
	- Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
	- Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
	- Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
	- Place images in: `./open_flamingo_datasets/Flickr30k/Images`

	#### 4. COCO Dataset (2014)
	- Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
	- Annotation files are included in the repository
	- Alternative annotation downloads:
	- [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
	- [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
	- Place images in:
	- `./open_flamingo_datasets/COCO/train2014`
	- `./open_flamingo_datasets/COCO/val2014`
	### CLIP Fine-tuning Datasets

	#### 1. COCO Counterfactuals (COCO-CFs)
	- Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
	- Unzip and place images in:
	- `./open_flamingo_datasets/COCO_CF/images`
	- `./clip_train_datasets/MS_COCO_COCO_CF/images`
	- Copy original images (ending with `_0.jpg`) to:
	```bash
	cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
	cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
	```

	#### 2. APGD Adversarial Images
	- Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
	- `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
	- `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`

	#### 3. COCO 2017 Validation Set
	- Download from [COCO website](https://cocodataset.org/#download)
	- Copy images to all CLIP training dataset folders:
	- `./clip_train_datasets/MS_COCO/images`
	- `./clip_train_datasets/MS_COCO_APGD_4/images`
	- `./clip_train_datasets/MS_COCO_APGD_1/images`
	- `./clip_train_datasets/MS_COCO_COCO_CF/images`

	#### 4. COCO Captions and Classification Datasets
	- Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
	- Place in: `./clip_train_datasets/MS_COCO`
	- Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
	- `Caltech101.zip` → unzip in `./image_classification_datasets`
	- `Caltech256.zip` → unzip in `./image_classification_datasets`
	- For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52

	---

	## Usage

	### Sparse vs Non-Sparse Attacks Evaluation

	Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
	```bash
	python -m vlm_eval.run_evaluation \
	--eval_flickr30 \
	--dont_save_adv \
	--verbose \
	--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
	--pert_factor_graph 0 \
	--itr 0 \
	--itr_clip 0 \
	--itr_dataset base \
	--itr_method APGD_1 \
	--vision_encoder_pretrained openai \
	--num_samples 8 \
	--trial_seeds 42 \
	--num_trials 1 \
	--shots 0 \
	--batch_size 1 \
	--results_file res9B \
	--model open_flamingo \
	--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
	--vision_encoder_path ViT-L-14 \
	--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
	--lm_path anas-awadalla/mpt-7b \
	--lm_tokenizer_path anas-awadalla/mpt-7b \
	--precision float16 \
	--cross_attn_every_n_layers 4 \
	--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
	--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
	--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
	--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
	--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
	--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
	--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
	--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
	--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
	--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
	--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
	--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
	--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
	--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
	--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
	--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
	--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
	--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
	--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
	--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
	--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
	--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
	--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
	--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
	--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
	--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
	--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
	--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
	--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
	--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
	--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
	```

	#### Configuration Options

	Attack Types:
	- APGD attack: `--attack apgd --eps <epsilon>`
	- SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
	- No attack (clean): `--attack none`
	- Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`

	Shot Settings:
	- 0-shot: `--shots 0`
	- 4-shot: `--shots 4`
	- Query mode: `--mask_out context`
	- All mode: `--mask_out none`

	Evaluation Tasks:
	- Image Captioning:
	- COCO: `--eval_coco`
	- Flickr30k: `--eval_flickr30`
	- Visual Question Answering:
	- VizWiz: `--eval_vizwiz`
	- OK-VQA: `--eval_ok_vqa`

	Other Options:
	- Save adversarial samples as `.pt` files: remove `--dont_save_adv`
	- Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`

	#### Running the Scripts

	```bash
	# Make scripts executable
	chmod +x ./bash/run_script.sh
	chmod +x ./bash/run_script_slurm.sh

	# Run locally or remotely
	./bash/run_script.sh

	# Run on SLURM cluster
	sbatch ./bash/run_script_slurm.sh
	```

	### Fine-tuning CLIP Models

	Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):

	```bash
	python vlm_eval/clip_train.py \
	--num_epochs 20 \
	--data_seeds 112 113 114 115 \
	--data_name base \
	--method APGD_4 \
	--batch_size 128 \
	--learning_rate 5e-7 \
	--save_model \
	--save_model_path ./fine_tuned_clip_models/APGD_4/
	```

	This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).

	#### Parameters

	- `--data_name`: Dataset size variant
	- `MS_COCO`: Standard MS COCO (see thesis appendix)
	- `base`: Base subset
	- `medium`: Medium subset
	- `all`: Complete dataset

	- `--method`: Training method
	- `APGD_4`: APGD with ε=4/255
	- `APGD_1`: APGD with ε=1/255
	- `COCO_CF`: COCO Counterfactuals
	- `NONE`: Clean MS COCO (no perturbations)

	- `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)

	#### Running the Scripts

	```bash
	# Make scripts executable
	chmod +x ./bash/clip_train.sh
	chmod +x ./bash/clip_train_slurm.sh

	# Run locally or remotely
	./bash/clip_train.sh

	# Run on SLURM cluster
	sbatch ./bash/clip_train_slurm.sh
	```

	### Zero-Shot Image Classification

	Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):

	```bash
	python vlm_eval/clip_classification.py \
	--data base \
	--method COCO_CF \
	--dataset Caltech101
	```

	This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.

	#### Parameters

	- `--data`: Dataset variant
	- `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
	- `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)

	- `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`

	- `--dataset`: Classification dataset
	- `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`

	Note: Evaluation is hardcoded to 20 epochs.

	#### Running the Scripts

	```bash
	chmod +x ./bash/clip_classification.sh
	chmod +x ./bash/clip_classification_slurm.sh

	# Run locally or remotely
	./bash/clip_classification.sh

	# Run on SLURM cluster
	sbatch ./bash/clip_classification_slurm.sh
	```

	---

	### Image-Text Retrieval

	Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
	```bash
	python -m vlm_eval.run_evaluation \
	--eval_flickr30 \
	--dont_save_adv \
	--verbose \
	--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
	--pert_factor_graph 0 \
	--itr 1 \
	--itr_clip 0 \
	--itr_dataset base \
	--itr_method APGD_1 \
	--vision_encoder_pretrained openai \
	--num_samples 1000 \
	--trial_seeds 42 \
	--num_trials 1 \
	--shots 0 \
	--batch_size 1 \
	--results_file res9B \
	--model open_flamingo \
	--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
	--vision_encoder_path ViT-L-14 \
	--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
	--lm_path anas-awadalla/mpt-7b \
	--lm_tokenizer_path anas-awadalla/mpt-7b \
	--precision float16 \
	--cross_attn_every_n_layers 4 \
	--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
	--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
	--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
	--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
	--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
	--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
	--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
	--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
	--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
	--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
	--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
	--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
	--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
	--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
	--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
	--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
	--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
	--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
	--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
	--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
	--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
	--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
	--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
	--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
	--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
	--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
	--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
	--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
	--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
	--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
	--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
	```

	This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).

	#### Parameters

	- `--itr_dataset`: Dataset for fine-tuned CLIP model
	- `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
	- `non_fine_tuned`: Pre-trained CLIP only

	Note: Image-text retrieval does not support targeted attacks or 4-shot settings.

	---



	## License

	Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.

	## Acknowledgments

	This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.