Spaces:

KC123hello
/

RobustMMFM

Configuration error

App Files Files Community

RobustMMFM / README.md

KC123hello

Upload 3 files

f0d69f7 verified 2 months ago

preview code

raw

history blame contribute delete

18.7 kB

Robustness of Multi-Modal Foundational Models

Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.

Code adapted from: RobustVLM

Robustness of Multi-Modal Foundational Models

Prerequisites

Python version: 3.11.x
Java: JDK 1.8.0_202 (required for CIDEr score computation)
CUDA-compatible GPU (for model training and inference)

Installation

Clone the repository and navigate to the project directory:
```
cd Robust_mmfm
```
Install required Python packages:
```
pip install -r requirements.txt
```
Download the OpenFlamingo 9B model from HuggingFace. After downloading, it should be located in $HOME/.cache/huggingface/hub/ with the name models--openflamingo--OpenFlamingo-9B-vitl-mpt7b.

Install JDK 1.8.0_202 and add it to your PATH:

# Add to ~/.bashrc or ~/.zshrc
export PATH=$PATH:/path/to/jdk1.8.0_202/bin
export LANG=en_US.UTF-8

Dataset Setup

VLM Evaluation Datasets

1. VizWiz Dataset

Download the VizWiz VQA dataset (train and validation sets)
Annotation files are included in the repository, but can be re-downloaded if corrupted
Place images in:
- ./open_flamingo_datasets/VizWiz/train
- ./open_flamingo_datasets/VizWiz/val

2. OK-VQA Dataset

Download the OK-VQA dataset (training and testing images)
Annotation files are included in the repository
Place all images in: ./open_flamingo_datasets/OKVQA

3. Flickr30k Dataset

Download using instructions from awsaf49/flickr-dataset
Annotation files (karpathy_flickr30k.json, dataset_flickr30k_coco_style.json) are included
Alternative annotation download: TU Berlin Cloud
Place images in: ./open_flamingo_datasets/Flickr30k/Images

4. COCO Dataset (2014)

Download COCO 2014 train and validation sets
Annotation files are included in the repository
Alternative annotation downloads:
- karpathy_coco.json
- captions_val2014.json
Place images in:
- ./open_flamingo_datasets/COCO/train2014
- ./open_flamingo_datasets/COCO/val2014

CLIP Fine-tuning Datasets

1. COCO Counterfactuals (COCO-CFs)

Download images.zip from HuggingFace COCO-Counterfactuals
Unzip and place images in:
- ./open_flamingo_datasets/COCO_CF/images
- ./clip_train_datasets/MS_COCO_COCO_CF/images

Copy original images (ending with _0.jpg) to:

cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images

2. APGD Adversarial Images

Download from TU Berlin Cloud:
- apgd_1_images.zip → ./clip_train_datasets/MS_COCO_APGD_1/images
- apgd_4_images.zip → ./clip_train_datasets/MS_COCO_APGD_4/images

3. COCO 2017 Validation Set

Download from COCO website
Copy images to all CLIP training dataset folders:
- ./clip_train_datasets/MS_COCO/images
- ./clip_train_datasets/MS_COCO_APGD_4/images
- ./clip_train_datasets/MS_COCO_APGD_1/images
- ./clip_train_datasets/MS_COCO_COCO_CF/images

4. COCO Captions and Classification Datasets

Download ms_coco_captions.json from TU Berlin Cloud
Place in: ./clip_train_datasets/MS_COCO
Download classification datasets from TU Berlin Cloud:
- Caltech101.zip → unzip in ./image_classification_datasets
- Caltech256.zip → unzip in ./image_classification_datasets
For ImageNet: Download externally and set path in vlm_eval/clip_classification.py line 52

Usage

Sparse vs Non-Sparse Attacks Evaluation

Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in bash/run_script.sh and bash/run_script_slurm.sh):

python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 0 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 8 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json

Configuration Options

Attack Types:

APGD attack: --attack apgd --eps <epsilon>
SAIF attack: --attack saif --eps <epsilon> --k <k_value>
No attack (clean): --attack none
Targeted attack (COCO only): --targeted --target_str "TARGET_STRING"

Shot Settings:

0-shot: --shots 0
4-shot: --shots 4
- Query mode: --mask_out context
- All mode: --mask_out none

Evaluation Tasks:

Image Captioning:
- COCO: --eval_coco
- Flickr30k: --eval_flickr30
Visual Question Answering:
- VizWiz: --eval_vizwiz
- OK-VQA: --eval_ok_vqa

Other Options:

Save adversarial samples as .pt files: remove --dont_save_adv
Generate perturbation factor graph (0-shot only): --pert_factor_graph 1

Running the Scripts

# Make scripts executable
chmod +x ./bash/run_script.sh
chmod +x ./bash/run_script_slurm.sh

# Run locally or remotely
./bash/run_script.sh

# Run on SLURM cluster
sbatch ./bash/run_script_slurm.sh

Fine-tuning CLIP Models

Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in bash/train_clip.sh and bash/train_clip_slurm.sh):

python vlm_eval/clip_train.py \
    --num_epochs 20 \
    --data_seeds 112 113 114 115 \
    --data_name base \
    --method APGD_4 \
    --batch_size 128 \
    --learning_rate 5e-7 \
    --save_model \
    --save_model_path ./fine_tuned_clip_models/APGD_4/

This fine-tunes CLIP for 20 epochs on the base dataset with APGD attack (ε=4/255).

Parameters

--data_name: Dataset size variant
- MS_COCO: Standard MS COCO (see thesis appendix)
- base: Base subset
- medium: Medium subset
- all: Complete dataset
--method: Training method
- APGD_4: APGD with ε=4/255
- APGD_1: APGD with ε=1/255
- COCO_CF: COCO Counterfactuals
- NONE: Clean MS COCO (no perturbations)
--data_seeds: Random seeds for dataset sampling (e.g., 112 113 114 115)

Running the Scripts

# Make scripts executable
chmod +x ./bash/clip_train.sh
chmod +x ./bash/clip_train_slurm.sh

# Run locally or remotely
./bash/clip_train.sh

# Run on SLURM cluster
sbatch ./bash/clip_train_slurm.sh

Zero-Shot Image Classification

Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in bash/clip_classification.sh and bash/clip_classification_slurm.sh):

python vlm_eval/clip_classification.py \
    --data base \
    --method COCO_CF \
    --dataset Caltech101

This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the base COCO counterfactuals dataset.

Parameters

--data: Dataset variant
- MS_COCO, base, medium, all: Fine-tuned models
- non_fine_tuned: Pre-trained CLIP only (no fine-tuning)
--method: APGD_4, APGD_1, COCO_CF, NONE
--dataset: Classification dataset
- Food101, CIFAR10, CIFAR100, ImageNet, Caltech101, Caltech256

Note: Evaluation is hardcoded to 20 epochs.

Running the Scripts

chmod +x ./bash/clip_classification.sh
chmod +x ./bash/clip_classification_slurm.sh

# Run locally or remotely
./bash/clip_classification.sh

# Run on SLURM cluster
sbatch ./bash/clip_classification_slurm.sh

Image-Text Retrieval

Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:

python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 1 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 1000 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json

This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the base APGD dataset (ε=1/255).

Parameters

--itr_dataset: Dataset for fine-tuned CLIP model
- MS_COCO, base, medium, all: Fine-tuned variants
- non_fine_tuned: Pre-trained CLIP only

Note: Image-text retrieval does not support targeted attacks or 4-shot settings.

License

Please refer to the original RobustVLM repository for licensing information.

Acknowledgments

This code is adapted from the RobustVLM repository. We thank the original authors for their foundational work.

Robustness of Multi-Modal Foundational Models

Table of Contents

Prerequisites

Installation

Dataset Setup

VLM Evaluation Datasets

1. VizWiz Dataset

2. OK-VQA Dataset

3. Flickr30k Dataset

4. COCO Dataset (2014)

CLIP Fine-tuning Datasets

1. COCO Counterfactuals (COCO-CFs)

2. APGD Adversarial Images

3. COCO 2017 Validation Set

4. COCO Captions and Classification Datasets

Usage

Sparse vs Non-Sparse Attacks Evaluation

Configuration Options

Running the Scripts

Fine-tuning CLIP Models

Parameters

Running the Scripts

Zero-Shot Image Classification

Parameters

Running the Scripts

Image-Text Retrieval

Parameters

License

Acknowledgments