Upload README.md
Browse files
README.md
CHANGED
|
@@ -55,3 +55,71 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.training_clip_slots --clip_model_na
|
|
| 55 |
|
| 56 |
The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx'
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx'
|
| 57 |
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
## Stage2: Training and Evaluation with Object-centric Representations
|
| 61 |
+
|
| 62 |
+
- SlotVLM<sup>4</sup>
|
| 63 |
+
```shell
|
| 64 |
+
python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Set `--eps 2` to obtain SlotVLM<sup>2</sup> models.
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
## Evaluation
|
| 71 |
+
Make sure files in `bash` directory are executable: `chmod +x bash/*`
|
| 72 |
+
### CLIP ImageNet
|
| 73 |
+
```shell
|
| 74 |
+
python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2
|
| 75 |
+
```
|
| 76 |
+
⬆ You should notice the `--pretrained` and the `--eps 2/4` for SlotVLM<sup>2/4</sup> models.
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
### CLIP Zero-Shot
|
| 80 |
+
Set models to be evaluated in `CLIP_benchmark/benchmark/models.txt` and datasets in `CLIP_benchmark/benchmark/datasets.txt`
|
| 81 |
+
(the datasets are downloaded from HuggingFace). Then run
|
| 82 |
+
```shell
|
| 83 |
+
cd CLIP_benchmark
|
| 84 |
+
./bash/run_benchmark_adv.sh
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### VLM Captioning and VQA
|
| 88 |
+
#### LLaVA
|
| 89 |
+
In `/bash/llava_eval.sh` supply paths for the datasets. The required annotation files for the datasets can be obtained from this [HuggingFace repository](https://huggingface.co/datasets/openflamingo/eval_benchmark/tree/main).
|
| 90 |
+
Set `--vision_encoder_pretrained` to `openai` or supply path to fine-tuned CLIP model checkpoint.
|
| 91 |
+
Then run
|
| 92 |
+
```shell
|
| 93 |
+
./bash/llava_eval.sh
|
| 94 |
+
```
|
| 95 |
+
The LLaVA model will be automatically downloaded from HuggingFace.
|
| 96 |
+
|
| 97 |
+
#### OpenFlamingo
|
| 98 |
+
Download the OpenFlamingo 9B [model](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b/tree/main), supply paths in `/bash/of_eval_9B.sh` and run
|
| 99 |
+
```shell
|
| 100 |
+
./bash/of_eval_9B.sh
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
Some non-standard annotation files are supplied [here](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) and [here](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval/data).
|
| 104 |
+
|
| 105 |
+
### VLM Stealthy Targeted Attacks
|
| 106 |
+
For targeted attacks on COCO, run
|
| 107 |
+
```shell
|
| 108 |
+
./bash/llava_eval_targeted.sh
|
| 109 |
+
```
|
| 110 |
+
For targeted attacks on self-selected images, set images and target captions in `vlm_eval/run_evaluation_qualitative.py` and run
|
| 111 |
+
```shell
|
| 112 |
+
python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose
|
| 113 |
+
```
|
| 114 |
+
With 10,000 iterations it takes about 2 hours per image on an A100 GPU.
|
| 115 |
+
|
| 116 |
+
### POPE
|
| 117 |
+
```shell
|
| 118 |
+
./bash/eval_pope.sh openai # for clean model evaluation
|
| 119 |
+
./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file
|
| 120 |
+
```
|
| 121 |
+
### SQA
|
| 122 |
+
```shell
|
| 123 |
+
./bash/eval_scienceqa.sh openai # for clean model evaluation
|
| 124 |
+
./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file
|
| 125 |
+
```
|