File size: 18,690 Bytes
f0d69f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9634425
f0d69f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9634425
 
f0d69f7
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# Robustness of Multi-Modal Foundational Models

Research code for evaluating the robustness of multi-modal foundational models (MMFMs) against adversarial attacks. This repository contains implementations for testing vision-language models like OpenFlamingo against sparse and non-sparse adversarial perturbations, as well as fine-tuning CLIP models on adversarial examples and COCO counterfactuals.

**Code adapted from:** [RobustVLM](https://github.com/chs20/RobustVLM)

## Table of Contents

- [Robustness of Multi-Modal Foundational Models](#robustness-of-multi-modal-foundational-models)
  - [Table of Contents](#table-of-contents)
  - [Prerequisites](#prerequisites)
  - [Installation](#installation)
  - [Dataset Setup](#dataset-setup)
    - [VLM Evaluation Datasets](#vlm-evaluation-datasets)
      - [1. VizWiz Dataset](#1-vizwiz-dataset)
      - [2. OK-VQA Dataset](#2-ok-vqa-dataset)
      - [3. Flickr30k Dataset](#3-flickr30k-dataset)
      - [4. COCO Dataset (2014)](#4-coco-dataset-2014)
    - [CLIP Fine-tuning Datasets](#clip-fine-tuning-datasets)
      - [1. COCO Counterfactuals (COCO-CFs)](#1-coco-counterfactuals-coco-cfs)
      - [2. APGD Adversarial Images](#2-apgd-adversarial-images)
      - [3. COCO 2017 Validation Set](#3-coco-2017-validation-set)
      - [4. COCO Captions and Classification Datasets](#4-coco-captions-and-classification-datasets)
  - [Usage](#usage)
    - [Sparse vs Non-Sparse Attacks Evaluation](#sparse-vs-non-sparse-attacks-evaluation)
      - [Configuration Options](#configuration-options)
      - [Running the Scripts](#running-the-scripts)
    - [Fine-tuning CLIP Models](#fine-tuning-clip-models)
      - [Parameters](#parameters)
      - [Running the Scripts](#running-the-scripts-1)
    - [Zero-Shot Image Classification](#zero-shot-image-classification)
      - [Parameters](#parameters-1)
      - [Running the Scripts](#running-the-scripts-2)
    - [Image-Text Retrieval](#image-text-retrieval)
      - [Parameters](#parameters-2)
  - [License](#license)
  - [Acknowledgments](#acknowledgments)

## Prerequisites

- **Python version:** 3.11.x
- **Java:** JDK 1.8.0_202 (required for CIDEr score computation)
- **CUDA-compatible GPU** (for model training and inference)

## Installation

1. Clone the repository and navigate to the project directory:
   ```bash
   cd Robust_mmfm
   ```

2. Install required Python packages:
   ```bash
   pip install -r requirements.txt
   ```

3. Download the OpenFlamingo 9B model from [HuggingFace](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b). After downloading, it should be located in `$HOME/.cache/huggingface/hub/` with the name `models--openflamingo--OpenFlamingo-9B-vitl-mpt7b`.

4. Install [JDK 1.8.0_202](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx) and add it to your PATH:
   ```bash
   # Add to ~/.bashrc or ~/.zshrc
   export PATH=$PATH:/path/to/jdk1.8.0_202/bin
   export LANG=en_US.UTF-8
   ```

## Dataset Setup

### VLM Evaluation Datasets

#### 1. VizWiz Dataset
- Download the [VizWiz VQA dataset](https://vizwiz.org/tasks-and-datasets/vqa/) (train and validation sets)
- Annotation files are included in the repository, but can be re-downloaded if corrupted
- Place images in:
  - `./open_flamingo_datasets/VizWiz/train`
  - `./open_flamingo_datasets/VizWiz/val`

#### 2. OK-VQA Dataset
- Download the [OK-VQA dataset](https://okvqa.allenai.org/download.html) (training and testing images)
- Annotation files are included in the repository
- Place all images in: `./open_flamingo_datasets/OKVQA`

#### 3. Flickr30k Dataset
- Download using instructions from [awsaf49/flickr-dataset](https://github.com/awsaf49/flickr-dataset)
- Annotation files (`karpathy_flickr30k.json`, `dataset_flickr30k_coco_style.json`) are included
- Alternative annotation download: [TU Berlin Cloud](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
- Place images in: `./open_flamingo_datasets/Flickr30k/Images`

#### 4. COCO Dataset (2014)
- Download [COCO 2014](https://cocodataset.org/#download) train and validation sets
- Annotation files are included in the repository
- Alternative annotation downloads:
  - [karpathy_coco.json](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX)
  - [captions_val2014.json](https://github.com/tylin/coco-caption/blob/master/annotations/captions_val2014.json)
- Place images in:
  - `./open_flamingo_datasets/COCO/train2014`
  - `./open_flamingo_datasets/COCO/val2014`
### CLIP Fine-tuning Datasets

#### 1. COCO Counterfactuals (COCO-CFs)
- Download `images.zip` from [HuggingFace COCO-Counterfactuals](https://huggingface.co/datasets/Intel/COCO-Counterfactuals/tree/main/data)
- Unzip and place images in:
  - `./open_flamingo_datasets/COCO_CF/images`
  - `./clip_train_datasets/MS_COCO_COCO_CF/images`
- Copy original images (ending with `_0.jpg`) to:
  ```bash
  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_4/images
  cp ./open_flamingo_datasets/COCO_CF/images/*_0.jpg ./clip_train_datasets/MS_COCO_APGD_1/images
  ```

#### 2. APGD Adversarial Images
- Download from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
  - `apgd_1_images.zip` → `./clip_train_datasets/MS_COCO_APGD_1/images`
  - `apgd_4_images.zip` → `./clip_train_datasets/MS_COCO_APGD_4/images`

#### 3. COCO 2017 Validation Set
- Download from [COCO website](https://cocodataset.org/#download)
- Copy images to all CLIP training dataset folders:
  - `./clip_train_datasets/MS_COCO/images`
  - `./clip_train_datasets/MS_COCO_APGD_4/images`
  - `./clip_train_datasets/MS_COCO_APGD_1/images`
  - `./clip_train_datasets/MS_COCO_COCO_CF/images`

#### 4. COCO Captions and Classification Datasets
- Download `ms_coco_captions.json` from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx)
- Place in: `./clip_train_datasets/MS_COCO`
- Download classification datasets from [TU Berlin Cloud](https://tubcloud.tu-berlin.de/s/YdRcyp888N5qwkx):
  - `Caltech101.zip` → unzip in `./image_classification_datasets`
  - `Caltech256.zip` → unzip in `./image_classification_datasets`
- For ImageNet: Download externally and set path in `vlm_eval/clip_classification.py` line 52

---

## Usage

### Sparse vs Non-Sparse Attacks Evaluation

Evaluate vision-language models against adversarial attacks. The following command demonstrates the evaluation setup (available in `bash/run_script.sh` and `bash/run_script_slurm.sh`):
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack saif --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 0 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 8 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```

#### Configuration Options

**Attack Types:**
- APGD attack: `--attack apgd --eps <epsilon>`
- SAIF attack: `--attack saif --eps <epsilon> --k <k_value>`
- No attack (clean): `--attack none`
- Targeted attack (COCO only): `--targeted --target_str "TARGET_STRING"`

**Shot Settings:**
- 0-shot: `--shots 0`
- 4-shot: `--shots 4`
  - Query mode: `--mask_out context`
  - All mode: `--mask_out none`

**Evaluation Tasks:**
- Image Captioning:
  - COCO: `--eval_coco`
  - Flickr30k: `--eval_flickr30`
- Visual Question Answering:
  - VizWiz: `--eval_vizwiz`
  - OK-VQA: `--eval_ok_vqa`

**Other Options:**
- Save adversarial samples as `.pt` files: remove `--dont_save_adv`
- Generate perturbation factor graph (0-shot only): `--pert_factor_graph 1`

#### Running the Scripts

```bash
# Make scripts executable
chmod +x ./bash/run_script.sh
chmod +x ./bash/run_script_slurm.sh

# Run locally or remotely
./bash/run_script.sh

# Run on SLURM cluster
sbatch ./bash/run_script_slurm.sh
```

### Fine-tuning CLIP Models

Fine-tune CLIP models on adversarial examples (APGD) and COCO counterfactuals. Example command (available in `bash/train_clip.sh` and `bash/train_clip_slurm.sh`):

```bash
python vlm_eval/clip_train.py \
    --num_epochs 20 \
    --data_seeds 112 113 114 115 \
    --data_name base \
    --method APGD_4 \
    --batch_size 128 \
    --learning_rate 5e-7 \
    --save_model \
    --save_model_path ./fine_tuned_clip_models/APGD_4/
```

This fine-tunes CLIP for 20 epochs on the `base` dataset with APGD attack (ε=4/255).

#### Parameters

- `--data_name`: Dataset size variant
  - `MS_COCO`: Standard MS COCO (see thesis appendix)
  - `base`: Base subset
  - `medium`: Medium subset
  - `all`: Complete dataset

- `--method`: Training method
  - `APGD_4`: APGD with ε=4/255
  - `APGD_1`: APGD with ε=1/255
  - `COCO_CF`: COCO Counterfactuals
  - `NONE`: Clean MS COCO (no perturbations)

- `--data_seeds`: Random seeds for dataset sampling (e.g., `112 113 114 115`)

#### Running the Scripts

```bash
# Make scripts executable
chmod +x ./bash/clip_train.sh
chmod +x ./bash/clip_train_slurm.sh

# Run locally or remotely
./bash/clip_train.sh

# Run on SLURM cluster
sbatch ./bash/clip_train_slurm.sh
```

### Zero-Shot Image Classification

Evaluate fine-tuned CLIP models on image classification tasks. Example command (available in `bash/clip_classification.sh` and `bash/clip_classification_slurm.sh`):

```bash
python vlm_eval/clip_classification.py \
    --data base \
    --method COCO_CF \
    --dataset Caltech101
```

This performs zero-shot classification on Caltech101 using a CLIP model fine-tuned on the `base` COCO counterfactuals dataset.

#### Parameters

- `--data`: Dataset variant
  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned models
  - `non_fine_tuned`: Pre-trained CLIP only (no fine-tuning)

- `--method`: `APGD_4`, `APGD_1`, `COCO_CF`, `NONE`

- `--dataset`: Classification dataset
  - `Food101`, `CIFAR10`, `CIFAR100`, `ImageNet`, `Caltech101`, `Caltech256`

**Note:** Evaluation is hardcoded to 20 epochs.

#### Running the Scripts

```bash
chmod +x ./bash/clip_classification.sh
chmod +x ./bash/clip_classification_slurm.sh

# Run locally or remotely
./bash/clip_classification.sh

# Run on SLURM cluster
sbatch ./bash/clip_classification_slurm.sh
```

---

### Image-Text Retrieval

Perform image-to-text (i2t) and text-to-image (t2i) retrieval tasks:
```bash
python -m vlm_eval.run_evaluation \
--eval_flickr30 \
--dont_save_adv \
--verbose \
--attack none --eps 255 --steps 100 --mask_out none --mu 1.5 --search_steps 2 --lam 0.005 --k 1000 \
--pert_factor_graph 0 \
--itr 1 \
--itr_clip 0 \
--itr_dataset base \
--itr_method APGD_1 \
--vision_encoder_pretrained openai \
--num_samples 1000 \
--trial_seeds 42 \
--num_trials 1 \
--shots 0 \
--batch_size 1 \
--results_file  res9B \
--model open_flamingo \
--out_base_path /PATH/TO/Robust_mmfm/Results/open_flamingo \
--vision_encoder_path ViT-L-14 \
--checkpoint_path /PATH/TO/HUGGINGFACE/hub/models--openflamingo--OpenFlamingo-9B-vitl-mpt7b/snapshots/7e36809c73d038829ad5fba9d0cc949b4e180562/checkpoint.pt \
--lm_path anas-awadalla/mpt-7b \
--lm_tokenizer_path anas-awadalla/mpt-7b \
--precision float16 \
--cross_attn_every_n_layers 4 \
--coco_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--coco_val_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--coco_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/karpathy_coco.json \
--coco_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/captions_val2014.json \
--coco_cf_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO_CF \
--flickr_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/Images \
--flickr_karpathy_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/karpathy_flickr30k.json \
--flickr_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/Flickr30k/dataset_flickr30k_coco_style.json \
--vizwiz_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train \
--vizwiz_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val \
--vizwiz_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_questions_vqa_format.json \
--vizwiz_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/train_annotations_vqa_format.json \
--vizwiz_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_questions_vqa_format.json \
--vizwiz_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/VizWiz/val_annotations_vqa_format.json \
--vqav2_train_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/train2014 \
--vqav2_train_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_train2014_questions.json \
--vqav2_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_train2014_annotations.json \
--vqav2_test_image_dir_path /home/htc/kchitranshi/SCRATCH/COCO/val2014 \
--vqav2_test_questions_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_OpenEnded_mscoco_val2014_questions.json \
--vqav2_test_annotations_json_path /home/htc/kchitranshi/SCRATCH/vqav2/v2_mscoco_val2014_annotations.json \
--textvqa_image_dir_path /mnt/datasets/textvqa/train_images \
--textvqa_train_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_questions_vqa_format.json \
--textvqa_train_annotations_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/train_annotations_vqa_format.json \
--textvqa_test_questions_json_path /home/htc/kchitranshi/SCRATCH/RobustVLM/textvqa/val_questions_vqa_format.json \
--textvqa_test_annotations_json_path /home/htc/kchitranshi/RobustVLM/textvqa/val_annotations_vqa_format.json \
--ok_vqa_train_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/train2014 \
--ok_vqa_train_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_train2014_questions.json \
--ok_vqa_train_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_train2014_annotations.json \
--ok_vqa_test_image_dir_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/COCO/val2014 \
--ok_vqa_test_questions_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/OpenEnded_mscoco_val2014_questions.json \
--ok_vqa_test_annotations_json_path /PATH/TO/Robust_mmfm/open_flamingo_datasets/OKVQA/mscoco_val2014_annotations.json
```

This evaluates i2t and t2i on the Flickr30k 1K test set (1000 samples) using a CLIP model fine-tuned on the `base` APGD dataset (ε=1/255).

#### Parameters

- `--itr_dataset`: Dataset for fine-tuned CLIP model
  - `MS_COCO`, `base`, `medium`, `all`: Fine-tuned variants
  - `non_fine_tuned`: Pre-trained CLIP only

**Note:** Image-text retrieval does not support targeted attacks or 4-shot settings.

---



## License

Please refer to the original [RobustVLM repository](https://github.com/chs20/RobustVLM) for licensing information.

## Acknowledgments

This code is adapted from the [RobustVLM](https://github.com/chs20/RobustVLM) repository. We thank the original authors for their foundational work.