KC123hello commited on
Commit
f0d69f7
·
verified ·
1 Parent(s): 9634425

Upload 3 files

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. MastersThesis_475703.pdf +3 -0
  3. README.md +405 -10
  4. requirements.txt +164 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ MastersThesis_475703.pdf filter=lfs diff=lfs merge=lfs -text
MastersThesis_475703.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcecb1b0a417a603e5848ccabbd8f5b65a27a0a3f2a0ff6c9969e1a99ba3c394
3
+ size 7791434
README.md CHANGED
@@ -1,13 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: RobustMMFM
3
- emoji: 🚀
4
- colorFrom: blue
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 6.0.2
8
- app_file: app.py
9
- pinned: false
10
- short_description: 'Interactive demo for the robustness evaluation of MMFM '
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
+ # Robustness of Multi-Modal Foundational Models
2
+
3
+ Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.
4
+
5
+ **Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM)
6
+
7
+ ## Table of Contents
8
+
9
+ - [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
10
+ - [Table of Contents](#table-of-contents)
11
+ - [Prerequisites](#prerequisites)
12
+ - [Installation](#installation)
13
+ - [Dataset Setup](#dataset-setup)
14
+ - [VLM Evaluation Datasets](#vlm-evaluation-datasets)
15
+ - [1. VizWiz Dataset](#1-vizwiz-dataset)
16
+ - [2. OK-VQA Dataset](#2-ok-vqa-dataset)
17
+ - [3. Flickr30k Dataset](#3-flickr30k-dataset)
18
+ - [4. COCO Dataset (2014)](#4-coco-dataset-2014)
19
+ - [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
20
+ - [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
21
+ - [2. APGD Adversarial Images](#2-apgd-adversarial-images)
22
+ - [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
23
+ - [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
24
+ - [Usage](#usage)
25
+ - [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
26
+ - [Configuration Options](#configuration-options)
27
+ - [Running the Scripts](#running-the-scripts)
28
+ - [Fine-tuning CLIP Models](#fine-tuning-clip-models)
29
+ - [Parameters](#parameters)
30
+ - [Running the Scripts](#running-the-scripts-1)
31
+ - [Zero-Shot Image Classification](#zero-shot-image-classification)
32
+ - [Parameters](#parameters-1)
33
+ - [Running the Scripts](#running-the-scripts-2)
34
+ - [Image-Text Retrieval](#image-text-retrieval)
35
+ - [Parameters](#parameters-2)
36
+ - [License](#license)
37
+ - [Acknowledgments](#acknowledgments)
38
+
39
+ ## Prerequisites
40
+
41
+ - **Python version:** 3.11.x
42
+ - **Java:** JDK 1.8.0_202 (required for CIDEr score computation)
43
+ - **CUDA-compatible GPU** (for model training and inference)
44
+
45
+ ## Installation
46
+
47
+ 1. Clone the repository and navigate to the project directory:
48
+ ```bash
49
+ cd Robust_mmfm
50
+ ```
51
+
52
+ 2. Install required Python packages:
53
+ ```bash
54
+ pip install -r requirements.txt
55
+ ```
56
+
57
+ 3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.
58
+
59
+ 4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
60
+ ```bash
61
+ # Add to ~/.bashrc or ~/.zshrc
62
+ export PATH=$PATH:/path/to/jdk1.8.0_202/bin
63
+ export LANG=en_US.UTF-8
64
+ ```
65
+
66
+ ## Dataset Setup
67
+
68
+ ### VLM Evaluation Datasets
69
+
70
+ #### 1. VizWiz Dataset
71
+ - Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
72
+ - Annotation files are included in the repository, but can be re-downloaded if corrupted
73
+ - Place images in:
74
+ - `./open_flamingo_datasets/VizWiz/train`
75
+ - `./open_flamingo_datasets/VizWiz/val`
76
+
77
+ #### 2. OK-VQA Dataset
78
+ - Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
79
+ - Annotation files are included in the repository
80
+ - Place all images in: `./open_flamingo_datasets/OKVQA`
81
+
82
+ #### 3. Flickr30k Dataset
83
+ - Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
84
+ - Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
85
+ - Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
86
+ - Place images in: `./open_flamingo_datasets/Flickr30k/Images`
87
+
88
+ #### 4. COCO Dataset (2014)
89
+ - Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
90
+ - Annotation files are included in the repository
91
+ - Alternative annotation downloads:
92
+ - [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
93
+ - [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
94
+ - Place images in:
95
+ - `./open_flamingo_datasets/COCO/train2014`
96
+ - `./open_flamingo_datasets/COCO/val2014`
97
+ ### CLIP Fine-tuning Datasets
98
+
99
+ #### 1. COCO Counterfactuals (COCO-CFs)
100
+ - Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
101
+ - Unzip and place images in:
102
+ - `./open_flamingo_datasets/COCO_CF/images`
103
+ - `./clip_train_datasets/MS_COCO_COCO_CF/images`
104
+ - Copy original images (ending with `_0.jpg`) to:
105
+ ```bash
106
+ cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
107
+ cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
108
+ ```
109
+
110
+ #### 2. APGD Adversarial Images
111
+ - Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
112
+ - `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
113
+ - `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`
114
+
115
+ #### 3. COCO 2017 Validation Set
116
+ - Download from [COCO website](https://cocodataset.org/#download)
117
+ - Copy images to all CLIP training dataset folders:
118
+ - `./clip_train_datasets/MS_COCO/images`
119
+ - `./clip_train_datasets/MS_COCO_APGD_4/images`
120
+ - `./clip_train_datasets/MS_COCO_APGD_1/images`
121
+ - `./clip_train_datasets/MS_COCO_COCO_CF/images`
122
+
123
+ #### 4. COCO Captions and Classification Datasets
124
+ - Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
125
+ - Place in: `./clip_train_datasets/MS_COCO`
126
+ - Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
127
+ - `Caltech101.zip` → unzip in `./image_classification_datasets`
128
+ - `Caltech256.zip` → unzip in `./image_classification_datasets`
129
+ - For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52
130
+
131
  ---
132
+
133
+ ## Usage
134
+
135
+ ### Sparse vs Non-Sparse Attacks Evaluation
136
+
137
+ Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
138
+ ```bash
139
+ python -m vlm_eval.run_evaluation \
140
+ --eval_flickr30 \
141
+ --dont_save_adv \
142
+ --verbose \
143
+ --attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
144
+ --pert_factor_graph 0 \
145
+ --itr 0 \
146
+ --itr_clip 0 \
147
+ --itr_dataset base \
148
+ --itr_method APGD_1 \
149
+ --vision_encoder_pretrained openai \
150
+ --num_samples 8 \
151
+ --trial_seeds 42 \
152
+ --num_trials 1 \
153
+ --shots 0 \
154
+ --batch_size 1 \
155
+ --results_file res9B \
156
+ --model open_flamingo \
157
+ --out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
158
+ --vision_encoder_path ViT-L-14 \
159
+ --checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
160
+ --lm_path anas-awadalla/mpt-7b \
161
+ --lm_tokenizer_path anas-awadalla/mpt-7b \
162
+ --precision float16 \
163
+ --cross_attn_every_n_layers 4 \
164
+ --coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
165
+ --coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
166
+ --coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
167
+ --coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
168
+ --coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
169
+ --flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
170
+ --flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
171
+ --flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
172
+ --vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
173
+ --vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
174
+ --vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
175
+ --vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
176
+ --vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
177
+ --vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
178
+ --vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
179
+ --vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
180
+ --vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
181
+ --vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
182
+ --vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
183
+ --vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
184
+ --textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
185
+ --textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
186
+ --textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
187
+ --textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
188
+ --textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
189
+ --ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
190
+ --ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
191
+ --ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
192
+ --ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
193
+ --ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
194
+ --ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
195
+ ```
196
+
197
+ #### Configuration Options
198
+
199
+ **Attack Types:**
200
+ - APGD attack: `--attack apgd --eps <epsilon>`
201
+ - SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
202
+ - No attack (clean): `--attack none`
203
+ - Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`
204
+
205
+ **Shot Settings:**
206
+ - 0-shot: `--shots 0`
207
+ - 4-shot: `--shots 4`
208
+ - Query mode: `--mask_out context`
209
+ - All mode: `--mask_out none`
210
+
211
+ **Evaluation Tasks:**
212
+ - Image Captioning:
213
+ - COCO: `--eval_coco`
214
+ - Flickr30k: `--eval_flickr30`
215
+ - Visual Question Answering:
216
+ - VizWiz: `--eval_vizwiz`
217
+ - OK-VQA: `--eval_ok_vqa`
218
+
219
+ **Other Options:**
220
+ - Save adversarial samples as `.pt` files: remove `--dont_save_adv`
221
+ - Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`
222
+
223
+ #### Running the Scripts
224
+
225
+ ```bash
226
+ # Make scripts executable
227
+ chmod +x ./bash/run_script.sh
228
+ chmod +x ./bash/run_script_slurm.sh
229
+
230
+ # Run locally or remotely
231
+ ./bash/run_script.sh
232
+
233
+ # Run on SLURM cluster
234
+ sbatch ./bash/run_script_slurm.sh
235
+ ```
236
+
237
+ ### Fine-tuning CLIP Models
238
+
239
+ Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):
240
+
241
+ ```bash
242
+ python vlm_eval/clip_train.py \
243
+ --num_epochs 20 \
244
+ --data_seeds 112 113 114 115 \
245
+ --data_name base \
246
+ --method APGD_4 \
247
+ --batch_size 128 \
248
+ --learning_rate 5e-7 \
249
+ --save_model \
250
+ --save_model_path ./fine_tuned_clip_models/APGD_4/
251
+ ```
252
+
253
+ This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).
254
+
255
+ #### Parameters
256
+
257
+ - `--data_name`: Dataset size variant
258
+ - `MS_COCO`: Standard MS COCO (see thesis appendix)
259
+ - `base`: Base subset
260
+ - `medium`: Medium subset
261
+ - `all`: Complete dataset
262
+
263
+ - `--method`: Training method
264
+ - `APGD_4`: APGD with ε=4/255
265
+ - `APGD_1`: APGD with ε=1/255
266
+ - `COCO_CF`: COCO Counterfactuals
267
+ - `NONE`: Clean MS COCO (no perturbations)
268
+
269
+ - `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)
270
+
271
+ #### Running the Scripts
272
+
273
+ ```bash
274
+ # Make scripts executable
275
+ chmod +x ./bash/clip_train.sh
276
+ chmod +x ./bash/clip_train_slurm.sh
277
+
278
+ # Run locally or remotely
279
+ ./bash/clip_train.sh
280
+
281
+ # Run on SLURM cluster
282
+ sbatch ./bash/clip_train_slurm.sh
283
+ ```
284
+
285
+ ### Zero-Shot Image Classification
286
+
287
+ Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):
288
+
289
+ ```bash
290
+ python vlm_eval/clip_classification.py \
291
+ --data base \
292
+ --method COCO_CF \
293
+ --dataset Caltech101
294
+ ```
295
+
296
+ This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.
297
+
298
+ #### Parameters
299
+
300
+ - `--data`: Dataset variant
301
+ - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
302
+ - `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)
303
+
304
+ - `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`
305
+
306
+ - `--dataset`: Classification dataset
307
+ - `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`
308
+
309
+ **Note:** Evaluation is hardcoded to 20 epochs.
310
+
311
+ #### Running the Scripts
312
+
313
+ ```bash
314
+ chmod +x ./bash/clip_classification.sh
315
+ chmod +x ./bash/clip_classification_slurm.sh
316
+
317
+ # Run locally or remotely
318
+ ./bash/clip_classification.sh
319
+
320
+ # Run on SLURM cluster
321
+ sbatch ./bash/clip_classification_slurm.sh
322
+ ```
323
+
324
+ ---
325
+
326
+ ### Image-Text Retrieval
327
+
328
+ Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
329
+ ```bash
330
+ python -m vlm_eval.run_evaluation \
331
+ --eval_flickr30 \
332
+ --dont_save_adv \
333
+ --verbose \
334
+ --attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
335
+ --pert_factor_graph 0 \
336
+ --itr 1 \
337
+ --itr_clip 0 \
338
+ --itr_dataset base \
339
+ --itr_method APGD_1 \
340
+ --vision_encoder_pretrained openai \
341
+ --num_samples 1000 \
342
+ --trial_seeds 42 \
343
+ --num_trials 1 \
344
+ --shots 0 \
345
+ --batch_size 1 \
346
+ --results_file res9B \
347
+ --model open_flamingo \
348
+ --out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
349
+ --vision_encoder_path ViT-L-14 \
350
+ --checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
351
+ --lm_path anas-awadalla/mpt-7b \
352
+ --lm_tokenizer_path anas-awadalla/mpt-7b \
353
+ --precision float16 \
354
+ --cross_attn_every_n_layers 4 \
355
+ --coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
356
+ --coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
357
+ --coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
358
+ --coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
359
+ --coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
360
+ --flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
361
+ --flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
362
+ --flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
363
+ --vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
364
+ --vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
365
+ --vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
366
+ --vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
367
+ --vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
368
+ --vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
369
+ --vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
370
+ --vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
371
+ --vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
372
+ --vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
373
+ --vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
374
+ --vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
375
+ --textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
376
+ --textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
377
+ --textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
378
+ --textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
379
+ --textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
380
+ --ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
381
+ --ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
382
+ --ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
383
+ --ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
384
+ --ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
385
+ --ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
386
+ ```
387
+
388
+ This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).
389
+
390
+ #### Parameters
391
+
392
+ - `--itr_dataset`: Dataset for fine-tuned CLIP model
393
+ - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
394
+ - `non_fine_tuned`: Pre-trained CLIP only
395
+
396
+ **Note:** Image-text retrieval does not support targeted attacks or 4-shot settings.
397
+
398
  ---
399
 
400
+
401
+
402
+ ## License
403
+
404
+ Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.
405
+
406
+ ## Acknowledgments
407
+
408
+ This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.
requirements.txt ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate==0.24.0
2
+ aiofiles==22.1.0
3
+ aiohttp==3.8.4
4
+ aiosignal==1.3.1
5
+ aiosqlite==0.19.0
6
+ anyio==3.6.2
7
+ appdirs==1.4.4
8
+ argon2-cffi==21.3.0
9
+ argon2-cffi-bindings==21.2.0
10
+ arrow==1.2.3
11
+ asttokens==2.2.1
12
+ async-timeout==4.0.2
13
+ attrs==23.1.0
14
+ Babel==2.12.1
15
+ backcall==0.2.0
16
+ beautifulsoup4==4.12.2
17
+ bleach==6.0.0
18
+ braceexpand==0.1.7
19
+ certifi==2023.5.7
20
+ cffi==1.15.1
21
+ chardet==4.0.0
22
+ charset-normalizer==3.1.0
23
+ click==8.1.3
24
+ cmake==3.26.3
25
+ comm==0.1.3
26
+ contourpy==1.0.7
27
+ cycler==0.11.0
28
+ datasets==2.12.0
29
+ debugpy==1.6.7
30
+ decorator==5.1.1
31
+ defusedxml==0.7.1
32
+ dill==0.3.6
33
+ docker-pycreds==0.4.0
34
+ einops==0.6.1
35
+ einops-exts==0.0.4
36
+ executing==1.2.0
37
+ fastjsonschema==2.16.3
38
+ filelock==3.12.0
39
+ fonttools==4.39.3
40
+ fqdn==1.5.1
41
+ frozenlist==1.3.3
42
+ fsspec==2023.5.0
43
+ ftfy==6.1.1
44
+ geotorch==0.3.0
45
+ gitdb==4.0.10
46
+ GitPython==3.1.31
47
+ huggingface-hub==0.14.1
48
+ idna==2.10
49
+ inflection==0.5.1
50
+ ipykernel==6.23.0
51
+ ipython==8.13.2
52
+ ipython-genutils==0.2.0
53
+ isoduration==20.11.0
54
+ jedi==0.18.2
55
+ Jinja2==3.1.2
56
+ joblib==1.2.0
57
+ json5==0.9.11
58
+ jsonpointer==2.3
59
+ jsonschema==4.17.3
60
+ kiwisolver==1.4.4
61
+ lit==16.0.3
62
+ MarkupSafe==2.1.2
63
+ matplotlib==3.7.1
64
+ matplotlib-inline==0.1.6
65
+ mistune==2.0.5
66
+ more-itertools==9.1.0
67
+ mpmath==1.3.0
68
+ multidict==6.0.4
69
+ multiprocess==0.70.14
70
+ nbclassic==1.0.0
71
+ nbclient==0.7.4
72
+ nbconvert==7.4.0
73
+ nbformat==5.8.0
74
+ nest-asyncio==1.5.6
75
+ networkx==3.1
76
+ nltk==3.8.1
77
+ notebook==6.5.4
78
+ notebook_shim==0.2.3
79
+ numpy==1.24.2
80
+ nvidia-cublas-cu11==11.10.3.66
81
+ nvidia-cuda-cupti-cu11==11.7.101
82
+ nvidia-cuda-nvrtc-cu11==11.7.99
83
+ nvidia-cuda-runtime-cu11==11.7.99
84
+ nvidia-cudnn-cu11==8.5.0.96
85
+ nvidia-cufft-cu11==10.9.0.58
86
+ nvidia-curand-cu11==10.2.10.91
87
+ nvidia-cusolver-cu11==11.4.0.1
88
+ nvidia-cusparse-cu11==11.7.4.91
89
+ nvidia-nccl-cu11==2.14.3
90
+ nvidia-nvtx-cu11==11.7.91
91
+ open-clip-torch==2.19.0
92
+ overrides==7.4.0
93
+ packaging==23.1
94
+ pandas==1.3.5
95
+ pandocfilters==1.5.0
96
+ parso==0.8.3
97
+ pathtools==0.1.2
98
+ pexpect==4.8.0
99
+ pickleshare==0.7.5
100
+ Pillow==9.5.0
101
+ platformdirs==3.5.0
102
+ prometheus-client==0.16.0
103
+ prompt-toolkit==3.0.38
104
+ protobuf==3.20.3
105
+ psutil==5.9.5
106
+ ptyprocess==0.7.0
107
+ pure-eval==0.2.2
108
+ pyarrow==12.0.0
109
+ pycocoevalcap==1.2
110
+ pycocotools==2.0.6
111
+ pycparser==2.21
112
+ Pygments==2.15.1
113
+ pyparsing==3.0.9
114
+ pyrsistent==0.19.3
115
+ python-dateutil==2.8.2
116
+ python-json-logger==2.0.7
117
+ pytz==2023.3
118
+ PyYAML==6.0
119
+ pyzmq==25.0.2
120
+ regex==2023.5.5
121
+ requests==2.25.1
122
+ responses==0.18.0
123
+ rfc3339-validator==0.1.4
124
+ rfc3986-validator==0.1.1
125
+ robustbench @ git+https://github.com/RobustBench/robustbench.git@e67e4225facde47be6a41ed78b576076e8b90cc5
126
+ scikit-learn==1.3.2
127
+ scipy==1.10.1
128
+ Send2Trash==1.8.2
129
+ sentencepiece==0.1.98
130
+ sentry-sdk==1.22.2
131
+ setproctitle==1.3.2
132
+ shortuuid==1.0.11
133
+ six==1.16.0
134
+ smmap==5.0.0
135
+ sniffio==1.3.0
136
+ soupsieve==2.4.1
137
+ stack-data==0.6.2
138
+ sympy==1.11.1
139
+ terminado==0.17.1
140
+ timm==0.6.13
141
+ tinycss2==1.2.1
142
+ tokenizers==0.13.3
143
+ torch==2.0.1
144
+ torchdiffeq==0.2.3
145
+ torchvision==0.15.2
146
+ tornado==6.3.1
147
+ tqdm==4.65.0
148
+ traitlets==5.9.0
149
+ transformers @ git+https://github.com/huggingface/transformers@d3cbc997a231098cca81ac27fd3028a5536abe67
150
+ triton==2.0.0
151
+ typing_extensions==4.5.0
152
+ tzdata==2023.3
153
+ uri-template==1.2.0
154
+ urllib3==1.26.15
155
+ wandb==0.15.2
156
+ wcwidth==0.2.6
157
+ webcolors==1.13
158
+ webdataset==0.2.48
159
+ webencodings==0.5.1
160
+ websocket-client==1.5.1
161
+ xxhash==3.2.0
162
+ y-py==0.5.9
163
+ yarl==1.9.2
164
+ ypy-websocket==0.8.2