RobustMMFM / README.md
KC123hello's picture
Upload 3 files
f0d69f7 verified

Robustness of Multi-Modal Foundational Models

Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.

Code adapted from: RobustVLM

Table of Contents

Prerequisites

  • Python version: 3.11.x
  • Java: JDK 1.8.0_202 (required for CIDEr score computation)
  • CUDA-compatible GPU (for model training and inference)

Installation

  1. Clone the repository and navigate to the project directory:

    cd Robust_mmfm
    
  2. Install required Python packages:

    pip install -r requirements.txt
    
  3. Download the OpenFlamingo 9B model from HuggingFace. After downloading, it should be located in $HOME/.cache/huggingface/hub/ with the name models--openflamingo--OpenFlamingo-9B-vitl-mpt7b.

  4. Install JDK 1.8.0_202 and add it to your PATH:

    # Add to ~/.bashrc or ~/.zshrc
    export PATH=$PATH:/path/to/jdk1.8.0_202/bin
    export LANG=en_US.UTF-8
    

Dataset Setup

VLM Evaluation Datasets

1. VizWiz Dataset

  • Download the VizWiz VQA dataset (train and validation sets)
  • Annotation files are included in the repository, but can be re-downloaded if corrupted
  • Place images in:
    • ./open_flamingo_datasets/VizWiz/train
    • ./open_flamingo_datasets/VizWiz/val

2. OK-VQA Dataset

  • Download the OK-VQA dataset (training and testing images)
  • Annotation files are included in the repository
  • Place all images in: ./open_flamingo_datasets/OKVQA

3. Flickr30k Dataset

  • Download using instructions from awsaf49/flickr-dataset
  • Annotation files (karpathy_flickr30k.json, dataset_flickr30k_coco_style.json) are included
  • Alternative annotation download: TU Berlin Cloud
  • Place images in: ./open_flamingo_datasets/Flickr30k/Images

4. COCO Dataset (2014)

  • Download COCO 2014 train and validation sets
  • Annotation files are included in the repository
  • Alternative annotation downloads:
  • Place images in:
    • ./open_flamingo_datasets/COCO/train2014
    • ./open_flamingo_datasets/COCO/val2014

CLIP Fine-tuning Datasets

1. COCO Counterfactuals (COCO-CFs)

  • Download images.zip from HuggingFace COCO-Counterfactuals
  • Unzip and place images in:
    • ./open_flamingo_datasets/COCO_CF/images
    • ./clip_train_datasets/MS_COCO_COCO_CF/images
  • Copy original images (ending with _0.jpg) to:
    cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
    cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
    

2. APGD Adversarial Images

  • Download from TU Berlin Cloud:
    • apgd_1_images.zip./clip_train_datasets/MS_COCO_APGD_1/images
    • apgd_4_images.zip./clip_train_datasets/MS_COCO_APGD_4/images

3. COCO 2017 Validation Set

  • Download from COCO website
  • Copy images to all CLIP training dataset folders:
    • ./clip_train_datasets/MS_COCO/images
    • ./clip_train_datasets/MS_COCO_APGD_4/images
    • ./clip_train_datasets/MS_COCO_APGD_1/images
    • ./clip_train_datasets/MS_COCO_COCO_CF/images

4. COCO Captions and Classification Datasets

  • Download ms_coco_captions.json from TU Berlin Cloud
  • Place in: ./clip_train_datasets/MS_COCO
  • Download classification datasets from TU Berlin Cloud:
    • Caltech101.zip → unzip in ./image_classification_datasets
    • Caltech256.zip → unzip in ./image_classification_datasets
  • For ImageNet: Download externally and set path in vlm_eval/clip_classification.py line 52

Usage

Sparse vs Non-Sparse Attacks Evaluation

Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in bash/run_script.sh and bash/run_script_slurm.sh):

python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 0 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 8 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json

Configuration Options

Attack Types:

  • APGD attack: --attack apgd --eps <epsilon>
  • SAIF attack: --attack saif --eps <epsilon> --k <k_value>
  • No attack (clean): --attack none
  • Targeted attack (COCO only): --targeted --target_str "TARGET_STRING"

Shot Settings:

  • 0-shot: --shots 0
  • 4-shot: --shots 4
    • Query mode: --mask_out context
    • All mode: --mask_out none

Evaluation Tasks:

  • Image Captioning:
    • COCO: --eval_coco
    • Flickr30k: --eval_flickr30
  • Visual Question Answering:
    • VizWiz: --eval_vizwiz
    • OK-VQA: --eval_ok_vqa

Other Options:

  • Save adversarial samples as .pt files: remove --dont_save_adv
  • Generate perturbation factor graph (0-shot only): --pert_factor_graph 1

Running the Scripts

# Make scripts executable
chmod +x ./bash/run_script.sh
chmod +x ./bash/run_script_slurm.sh

# Run locally or remotely
./bash/run_script.sh

# Run on SLURM cluster
sbatch ./bash/run_script_slurm.sh

Fine-tuning CLIP Models

Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in bash/train_clip.sh and bash/train_clip_slurm.sh):

python vlm_eval/clip_train.py \
    --num_epochs 20 \
    --data_seeds 112 113 114 115 \
    --data_name base \
    --method APGD_4 \
    --batch_size 128 \
    --learning_rate 5e-7 \
    --save_model \
    --save_model_path ./fine_tuned_clip_models/APGD_4/

This fine-tunes CLIP for 20 epochs on the base dataset with APGD attack (ε=4/255).

Parameters

  • --data_name: Dataset size variant

    • MS_COCO: Standard MS COCO (see thesis appendix)
    • base: Base subset
    • medium: Medium subset
    • all: Complete dataset
  • --method: Training method

    • APGD_4: APGD with ε=4/255
    • APGD_1: APGD with ε=1/255
    • COCO_CF: COCO Counterfactuals
    • NONE: Clean MS COCO (no perturbations)
  • --data_seeds: Random seeds for dataset sampling (e.g., 112 113 114 115)

Running the Scripts

# Make scripts executable
chmod +x ./bash/clip_train.sh
chmod +x ./bash/clip_train_slurm.sh

# Run locally or remotely
./bash/clip_train.sh

# Run on SLURM cluster
sbatch ./bash/clip_train_slurm.sh

Zero-Shot Image Classification

Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in bash/clip_classification.sh and bash/clip_classification_slurm.sh):

python vlm_eval/clip_classification.py \
    --data base \
    --method COCO_CF \
    --dataset Caltech101

This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the base COCO counterfactuals dataset.

Parameters

  • --data: Dataset variant

    • MS_COCO, base, medium, all: Fine-tuned models
    • non_fine_tuned: Pre-trained CLIP only (no fine-tuning)
  • --method: APGD_4, APGD_1, COCO_CF, NONE

  • --dataset: Classification dataset

    • Food101, CIFAR10, CIFAR100, ImageNet, Caltech101, Caltech256

Note: Evaluation is hardcoded to 20 epochs.

Running the Scripts

chmod +x ./bash/clip_classification.sh
chmod +x ./bash/clip_classification_slurm.sh

# Run locally or remotely
./bash/clip_classification.sh

# Run on SLURM cluster
sbatch ./bash/clip_classification_slurm.sh

Image-Text Retrieval

Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:

python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 1 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 1000 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json

This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the base APGD dataset (ε=1/255).

Parameters

  • --itr_dataset: Dataset for fine-tuned CLIP model
    • MS_COCO, base, medium, all: Fine-tuned variants
    • non_fine_tuned: Pre-trained CLIP only

Note: Image-text retrieval does not support targeted attacks or 4-shot settings.


License

Please refer to the original RobustVLM repository for licensing information.

Acknowledgments

This code is adapted from the RobustVLM repository. We thank the original authors for their foundational work.