Spaces:

KC123hello
/

RobustMMFM

Configuration error

App Files Files Community

KC123hello commited on Dec 2, 2025

Commit

f0d69f7

verified ·

1 Parent(s): 9634425

Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +1 -0
MastersThesis_475703.pdf +3 -0
README.md +405 -10
requirements.txt +164 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+MastersThesis_475703.pdf filter=lfs diff=lfs merge=lfs -text

MastersThesis_475703.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fcecb1b0a417a603e5848ccabbd8f5b65a27a0a3f2a0ff6c9969e1a99ba3c394
+size 7791434

README.md CHANGED Viewed

@@ -1,13 +1,408 @@
 ---
-title: RobustMMFM
-emoji: 🚀
-colorFrom: blue
-colorTo: gray
-sdk: gradio
-sdk_version: 6.0.2
-app_file: app.py
-pinned: false
-short_description: 'Interactive demo for the robustness evaluation of MMFM '
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Robustness of Multi-Modal Foundational Models
+Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.
+**Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM)
+## Table of Contents
+- [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
+  - [Table of Contents](#table-of-contents)
+  - [Prerequisites](#prerequisites)
+  - [Installation](#installation)
+  - [Dataset Setup](#dataset-setup)
+    - [VLM Evaluation Datasets](#vlm-evaluation-datasets)
+      - [1. VizWiz Dataset](#1-vizwiz-dataset)
+      - [2. OK-VQA Dataset](#2-ok-vqa-dataset)
+      - [3. Flickr30k Dataset](#3-flickr30k-dataset)
+      - [4. COCO Dataset (2014)](#4-coco-dataset-2014)
+    - [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
+      - [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
+      - [2. APGD Adversarial Images](#2-apgd-adversarial-images)
+      - [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
+      - [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
+  - [Usage](#usage)
+    - [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
+      - [Configuration Options](#configuration-options)
+      - [Running the Scripts](#running-the-scripts)
+    - [Fine-tuning CLIP Models](#fine-tuning-clip-models)
+      - [Parameters](#parameters)
+      - [Running the Scripts](#running-the-scripts-1)
+    - [Zero-Shot Image Classification](#zero-shot-image-classification)
+      - [Parameters](#parameters-1)
+      - [Running the Scripts](#running-the-scripts-2)
+    - [Image-Text Retrieval](#image-text-retrieval)
+      - [Parameters](#parameters-2)
+  - [License](#license)
+  - [Acknowledgments](#acknowledgments)
+## Prerequisites
+- **Python version:** 3.11.x
+- **Java:** JDK 1.8.0_202 (required for CIDEr score computation)
+- **CUDA-compatible GPU** (for model training and inference)
+## Installation
+1. Clone the repository and navigate to the project directory:
+   ```bash
+   cd Robust_mmfm
+   ```
+2. Install required Python packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.
+4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
+   ```bash
+   # Add to ~/.bashrc or ~/.zshrc
+   export PATH=$PATH:/path/to/jdk1.8.0_202/bin
+   export LANG=en_US.UTF-8
+   ```
+## Dataset Setup
+### VLM Evaluation Datasets
+#### 1. VizWiz Dataset
+- Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
+- Annotation files are included in the repository, but can be re-downloaded if corrupted
+- Place images in:
+  - `./open_flamingo_datasets/VizWiz/train`
+  - `./open_flamingo_datasets/VizWiz/val`
+#### 2. OK-VQA Dataset
+- Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
+- Annotation files are included in the repository
+- Place all images in: `./open_flamingo_datasets/OKVQA`
+#### 3. Flickr30k Dataset
+- Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
+- Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
+- Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
+- Place images in: `./open_flamingo_datasets/Flickr30k/Images`
+#### 4. COCO Dataset (2014)
+- Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
+- Annotation files are included in the repository
+- Alternative annotation downloads:
+  - [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
+  - [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
+- Place images in:
+  - `./open_flamingo_datasets/COCO/train2014`
+  - `./open_flamingo_datasets/COCO/val2014`
+### CLIP Fine-tuning Datasets
+#### 1. COCO Counterfactuals (COCO-CFs)
+- Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
+- Unzip and place images in:
+  - `./open_flamingo_datasets/COCO_CF/images`
+  - `./clip_train_datasets/MS_COCO_COCO_CF/images`
+- Copy original images (ending with `_0.jpg`) to:
+  ```bash
+  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
+  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
+  ```
+#### 2. APGD Adversarial Images
+- Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
+  - `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
+  - `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`
+#### 3. COCO 2017 Validation Set
+- Download from [COCO website](https://cocodataset.org/#download)
+- Copy images to all CLIP training dataset folders:
+  - `./clip_train_datasets/MS_COCO/images`
+  - `./clip_train_datasets/MS_COCO_APGD_4/images`
+  - `./clip_train_datasets/MS_COCO_APGD_1/images`
+  - `./clip_train_datasets/MS_COCO_COCO_CF/images`
+#### 4. COCO Captions and Classification Datasets
+- Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
+- Place in: `./clip_train_datasets/MS_COCO`
+- Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
+  - `Caltech101.zip` → unzip in `./image_classification_datasets`
+  - `Caltech256.zip` → unzip in `./image_classification_datasets`
+- For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52
 ---
+## Usage
+### Sparse vs Non-Sparse Attacks Evaluation
+Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
+```bash
+python -m vlm_eval.run_evaluation \
+--eval_flickr30 \
+--dont_save_adv \
+--verbose \
+--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
+--pert_factor_graph 0 \
+--itr 0 \
+--itr_clip 0 \
+--itr_dataset base \
+--itr_method APGD_1 \
+--vision_encoder_pretrained openai \
+--num_samples 8 \
+--trial_seeds 42 \
+--num_trials 1 \
+--shots 0 \
+--batch_size 1 \
+--results_file  res9B \
+--model open_flamingo \
+--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
+--vision_encoder_path ViT-L-14 \
+--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
+--lm_path anas-awadalla/mpt-7b \
+--lm_tokenizer_path anas-awadalla/mpt-7b \
+--precision float16 \
+--cross_attn_every_n_layers 4 \
+--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
+--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
+--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
+--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
+--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
+--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
+--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
+--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
+--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
+--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
+--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
+--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
+--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
+--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
+--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
+--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
+--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
+--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
+--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
+--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
+--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
+--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
+--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
+--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
+--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
+--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
+--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
+--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
+--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
+--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
+--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
+```
+#### Configuration Options
+**Attack Types:**
+- APGD attack: `--attack apgd --eps <epsilon>`
+- SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
+- No attack (clean): `--attack none`
+- Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`
+**Shot Settings:**
+- 0-shot: `--shots 0`
+- 4-shot: `--shots 4`
+  - Query mode: `--mask_out context`
+  - All mode: `--mask_out none`
+**Evaluation Tasks:**
+- Image Captioning:
+  - COCO: `--eval_coco`
+  - Flickr30k: `--eval_flickr30`
+- Visual Question Answering:
+  - VizWiz: `--eval_vizwiz`
+  - OK-VQA: `--eval_ok_vqa`
+**Other Options:**
+- Save adversarial samples as `.pt` files: remove `--dont_save_adv`
+- Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`
+#### Running the Scripts
+```bash
+# Make scripts executable
+chmod +x ./bash/run_script.sh
+chmod +x ./bash/run_script_slurm.sh
+# Run locally or remotely
+./bash/run_script.sh
+# Run on SLURM cluster
+sbatch ./bash/run_script_slurm.sh
+```
+### Fine-tuning CLIP Models
+Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):
+```bash
+python vlm_eval/clip_train.py \
+    --num_epochs 20 \
+    --data_seeds 112 113 114 115 \
+    --data_name base \
+    --method APGD_4 \
+    --batch_size 128 \
+    --learning_rate 5e-7 \
+    --save_model \
+    --save_model_path ./fine_tuned_clip_models/APGD_4/
+```
+This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).
+#### Parameters
+- `--data_name`: Dataset size variant
+  - `MS_COCO`: Standard MS COCO (see thesis appendix)
+  - `base`: Base subset
+  - `medium`: Medium subset
+  - `all`: Complete dataset
+- `--method`: Training method
+  - `APGD_4`: APGD with ε=4/255
+  - `APGD_1`: APGD with ε=1/255
+  - `COCO_CF`: COCO Counterfactuals
+  - `NONE`: Clean MS COCO (no perturbations)
+- `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)
+#### Running the Scripts
+```bash
+# Make scripts executable
+chmod +x ./bash/clip_train.sh
+chmod +x ./bash/clip_train_slurm.sh
+# Run locally or remotely
+./bash/clip_train.sh
+# Run on SLURM cluster
+sbatch ./bash/clip_train_slurm.sh
+```
+### Zero-Shot Image Classification
+Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):
+```bash
+python vlm_eval/clip_classification.py \
+    --data base \
+    --method COCO_CF \
+    --dataset Caltech101
+```
+This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.
+#### Parameters
+- `--data`: Dataset variant
+  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
+  - `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)
+- `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`
+- `--dataset`: Classification dataset
+  - `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`
+**Note:** Evaluation is hardcoded to 20 epochs.
+#### Running the Scripts
+```bash
+chmod +x ./bash/clip_classification.sh
+chmod +x ./bash/clip_classification_slurm.sh
+# Run locally or remotely
+./bash/clip_classification.sh
+# Run on SLURM cluster
+sbatch ./bash/clip_classification_slurm.sh
+```
+---
+### Image-Text Retrieval
+Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
+```bash
+python -m vlm_eval.run_evaluation \
+--eval_flickr30 \
+--dont_save_adv \
+--verbose \
+--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
+--pert_factor_graph 0 \
+--itr 1 \
+--itr_clip 0 \
+--itr_dataset base \
+--itr_method APGD_1 \
+--vision_encoder_pretrained openai \
+--num_samples 1000 \
+--trial_seeds 42 \
+--num_trials 1 \
+--shots 0 \
+--batch_size 1 \
+--results_file  res9B \
+--model open_flamingo \
+--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
+--vision_encoder_path ViT-L-14 \
+--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
+--lm_path anas-awadalla/mpt-7b \
+--lm_tokenizer_path anas-awadalla/mpt-7b \
+--precision float16 \
+--cross_attn_every_n_layers 4 \
+--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
+--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
+--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
+--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
+--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
+--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
+--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
+--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
+--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
+--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
+--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
+--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
+--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
+--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
+--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
+--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
+--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
+--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
+--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
+--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
+--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
+--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
+--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
+--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
+--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
+--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
+--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
+--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
+--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
+--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
+--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
+```
+This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).
+#### Parameters
+- `--itr_dataset`: Dataset for fine-tuned CLIP model
+  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
+  - `non_fine_tuned`: Pre-trained CLIP only
+**Note:** Image-text retrieval does not support targeted attacks or 4-shot settings.
 ---
+## License
+Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.
+## Acknowledgments
+This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.

requirements.txt ADDED Viewed

	@@ -0,0 +1,164 @@

+accelerate==0.24.0
+aiofiles==22.1.0
+aiohttp==3.8.4
+aiosignal==1.3.1
+aiosqlite==0.19.0
+anyio==3.6.2
+appdirs==1.4.4
+argon2-cffi==21.3.0
+argon2-cffi-bindings==21.2.0
+arrow==1.2.3
+asttokens==2.2.1
+async-timeout==4.0.2
+attrs==23.1.0
+Babel==2.12.1
+backcall==0.2.0
+beautifulsoup4==4.12.2
+bleach==6.0.0
+braceexpand==0.1.7
+certifi==2023.5.7
+cffi==1.15.1
+chardet==4.0.0
+charset-normalizer==3.1.0
+click==8.1.3
+cmake==3.26.3
+comm==0.1.3
+contourpy==1.0.7
+cycler==0.11.0
+datasets==2.12.0
+debugpy==1.6.7
+decorator==5.1.1
+defusedxml==0.7.1
+dill==0.3.6
+docker-pycreds==0.4.0
+einops==0.6.1
+einops-exts==0.0.4
+executing==1.2.0
+fastjsonschema==2.16.3
+filelock==3.12.0
+fonttools==4.39.3
+fqdn==1.5.1
+frozenlist==1.3.3
+fsspec==2023.5.0
+ftfy==6.1.1
+geotorch==0.3.0
+gitdb==4.0.10
+GitPython==3.1.31
+huggingface-hub==0.14.1
+idna==2.10
+inflection==0.5.1
+ipykernel==6.23.0
+ipython==8.13.2
+ipython-genutils==0.2.0
+isoduration==20.11.0
+jedi==0.18.2
+Jinja2==3.1.2
+joblib==1.2.0
+json5==0.9.11
+jsonpointer==2.3
+jsonschema==4.17.3
+kiwisolver==1.4.4
+lit==16.0.3
+MarkupSafe==2.1.2
+matplotlib==3.7.1
+matplotlib-inline==0.1.6
+mistune==2.0.5
+more-itertools==9.1.0
+mpmath==1.3.0
+multidict==6.0.4
+multiprocess==0.70.14
+nbclassic==1.0.0
+nbclient==0.7.4
+nbconvert==7.4.0
+nbformat==5.8.0
+nest-asyncio==1.5.6
+networkx==3.1
+nltk==3.8.1
+notebook==6.5.4
+notebook_shim==0.2.3
+numpy==1.24.2
+nvidia-cublas-cu11==11.10.3.66
+nvidia-cuda-cupti-cu11==11.7.101
+nvidia-cuda-nvrtc-cu11==11.7.99
+nvidia-cuda-runtime-cu11==11.7.99
+nvidia-cudnn-cu11==8.5.0.96
+nvidia-cufft-cu11==10.9.0.58
+nvidia-curand-cu11==10.2.10.91
+nvidia-cusolver-cu11==11.4.0.1
+nvidia-cusparse-cu11==11.7.4.91
+nvidia-nccl-cu11==2.14.3
+nvidia-nvtx-cu11==11.7.91
+open-clip-torch==2.19.0
+overrides==7.4.0
+packaging==23.1
+pandas==1.3.5
+pandocfilters==1.5.0
+parso==0.8.3
+pathtools==0.1.2
+pexpect==4.8.0
+pickleshare==0.7.5
+Pillow==9.5.0
+platformdirs==3.5.0
+prometheus-client==0.16.0
+prompt-toolkit==3.0.38
+protobuf==3.20.3
+psutil==5.9.5
+ptyprocess==0.7.0
+pure-eval==0.2.2
+pyarrow==12.0.0
+pycocoevalcap==1.2
+pycocotools==2.0.6
+pycparser==2.21
+Pygments==2.15.1
+pyparsing==3.0.9
+pyrsistent==0.19.3
+python-dateutil==2.8.2
+python-json-logger==2.0.7
+pytz==2023.3
+PyYAML==6.0
+pyzmq==25.0.2
+regex==2023.5.5
+requests==2.25.1
+responses==0.18.0
+rfc3339-validator==0.1.4
+rfc3986-validator==0.1.1
+robustbench @ git+https://github.com/RobustBench/robustbench.git@e67e4225facde47be6a41ed78b576076e8b90cc5
+scikit-learn==1.3.2
+scipy==1.10.1
+Send2Trash==1.8.2
+sentencepiece==0.1.98
+sentry-sdk==1.22.2
+setproctitle==1.3.2
+shortuuid==1.0.11
+six==1.16.0
+smmap==5.0.0
+sniffio==1.3.0
+soupsieve==2.4.1
+stack-data==0.6.2
+sympy==1.11.1
+terminado==0.17.1
+timm==0.6.13
+tinycss2==1.2.1
+tokenizers==0.13.3
+torch==2.0.1
+torchdiffeq==0.2.3
+torchvision==0.15.2
+tornado==6.3.1
+tqdm==4.65.0
+traitlets==5.9.0
+transformers @ git+https://github.com/huggingface/transformers@d3cbc997a231098cca81ac27fd3028a5536abe67
+triton==2.0.0
+typing_extensions==4.5.0
+tzdata==2023.3
+uri-template==1.2.0
+urllib3==1.26.15
+wandb==0.15.2
+wcwidth==0.2.6
+webcolors==1.13
+webdataset==0.2.48
+webencodings==0.5.1
+websocket-client==1.5.1
+xxhash==3.2.0
+y-py==0.5.9
+yarl==1.9.2
+ypy-websocket==0.8.2