| # RobustVLM (Foundation Models) via Object-centric Learning | |
| ## Table of Contents | |
| - [Installation](#installation) | |
| - [Stage1: Get Object-centric Models](#models) | |
| - [Dataset](#loading-pretrained-models) | |
| - [Training](#summary-of-results) | |
| ## Installation | |
| Create and activate anaconda environment: | |
| ```shell | |
| conda create -n robustclip python==3.11 | |
| ``` | |
| ```shell | |
| conda activate robustclip | |
| ``` | |
| The code is tested with Python 3.11. To install the required packages, run: | |
| ```shell | |
| pip install -r requirements.txt | |
| ``` | |
| To install the open_clip_torch locally run: | |
| ```shell | |
| cd ./open_clip_torch | |
| ``` | |
| ```shell | |
| python setup.py develop | |
| ``` | |
| ## Stage1: Get Object-centric Models | |
| ### Dataset | |
| Prepare the ImageNet dataset in a torch.ImageFolder style format: | |
| ``` | |
| dataset_path | |
| └─imagenet | |
| └─train | |
| └─n01440764 | |
| xxxxxx.JPEG | |
| ..... | |
| └─...... | |
| └─val | |
| └─n04254680 | |
| xxxxxx.JPEG | |
| ..... | |
| └─...... | |
| ``` | |
| ### Training | |
| - Slot-Attention on 4GPUs | |
| ```shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.training_clip_slots --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /.../.../dataset_path/imagenet --template std --output_normalize False --steps 1000000 --warmup 10000 --batch_size 128 --loss l2 --opt adamw --lr 5e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output_slots --experiment_name SLOTS --log_freq 1000 --eval_freq 1000``` | |
| ``` | |
| The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx' | |
| ## Stage2: Training and Evaluation with Object-centric Representations | |
| - SlotVLM<sup>4</sup> | |
| ```shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10 | |
| ``` | |
| Set `--eps 2` to obtain SlotVLM<sup>2</sup> models. | |
| \ | |
| If you want to resume your training, just add some params like: | |
| \ | |
| --optimizer_state /xxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none | |
| ```shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10 --optimizer_state /home/xxx/RobustVLM/output/ViT-L-14_openai_imagenet_l2_imagenet_with_Object_Token_xxxxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none | |
| ``` | |
| ## Evaluation | |
| Make sure files in `bash` directory are executable: `chmod +x bash/*` | |
| ### CLIP ImageNet | |
| ```shell | |
| python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2 | |
| ``` | |
| ⬆ You should notice the `--pretrained` and the `--eps 2/4` for SlotVLM<sup>2/4</sup> models. | |
| ### CLIP Zero-Shot | |
| Set models to be evaluated in `CLIP_benchmark/benchmark/models.txt` and datasets in `CLIP_benchmark/benchmark/datasets.txt` | |
| (the datasets are downloaded from HuggingFace). Then run | |
| ```shell | |
| cd CLIP_benchmark | |
| ./bash/run_benchmark_adv.sh | |
| ``` | |
| ### VLM Captioning and VQA | |
| #### LLaVA | |
| In `/bash/llava_eval.sh` supply paths for the datasets. The required annotation files for the datasets can be obtained from this [HuggingFace repository](https://huggingface.co/datasets/openflamingo/eval_benchmark/tree/main). | |
| Set `--vision_encoder_pretrained` to `openai` or supply path to fine-tuned CLIP model checkpoint. | |
| Then run | |
| ```shell | |
| ./bash/llava_eval.sh | |
| ``` | |
| The LLaVA model will be automatically downloaded from HuggingFace. | |
| #### OpenFlamingo | |
| Download the OpenFlamingo 9B [model](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b/tree/main), supply paths in `/bash/of_eval_9B.sh` and run | |
| ```shell | |
| ./bash/of_eval_9B.sh | |
| ``` | |
| Some non-standard annotation files are supplied [here](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) and [here](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval/data). | |
| ### VLM Stealthy Targeted Attacks | |
| For targeted attacks on COCO, run | |
| ```shell | |
| ./bash/llava_eval_targeted.sh | |
| ``` | |
| For targeted attacks on self-selected images, set images and target captions in `vlm_eval/run_evaluation_qualitative.py` and run | |
| ```shell | |
| python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose | |
| ``` | |
| With 10,000 iterations it takes about 2 hours per image on an A100 GPU. | |
| ### POPE | |
| ```shell | |
| ./bash/eval_pope.sh openai # for clean model evaluation | |
| ./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file | |
| ``` | |
| ### SQA | |
| ```shell | |
| ./bash/eval_scienceqa.sh openai # for clean model evaluation | |
| ./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file | |
| ``` | |