| <!--- | |
| Copyright 2022 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); | |
| you may not use this file except in compliance with the License. | |
| You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software | |
| distributed under the License is distributed on an "AS IS" BASIS, | |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| See the License for the specific language governing permissions and | |
| limitations under the License. | |
| --> | |
| # VisionTextDualEncoder and CLIP model training examples | |
| The following example showcases how to train a CLIP-like vision-text dual encoder model | |
| using a pre-trained vision and text encoder. | |
| Such a model can be used for natural language image search and potentially zero-shot image classification. | |
| The model is inspired by [CLIP](https://openai.com/blog/clip/), introduced by Alec Radford et al. | |
| The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their | |
| captions into the same embedding space, such that the caption embeddings are located near the embeddings | |
| of the images they describe. | |
| ### Download COCO dataset (2017) | |
| This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the | |
| COCO dataset before training. | |
| ```bash | |
| mkdir data | |
| cd data | |
| wget http://images.cocodataset.org/zips/train2017.zip | |
| wget http://images.cocodataset.org/zips/val2017.zip | |
| wget http://images.cocodataset.org/zips/test2017.zip | |
| wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip | |
| wget http://images.cocodataset.org/annotations/image_info_test2017.zip | |
| cd .. | |
| ``` | |
| Having downloaded COCO dataset manually you should be able to load with the `ydshieh/coc_dataset_script` dataset loading script: | |
| ```py | |
| import os | |
| import datasets | |
| COCO_DIR = os.path.join(os.getcwd(), "data") | |
| ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR) | |
| ``` | |
| ### Create a model from a vision encoder model and a text encoder model | |
| Next, we create a [VisionTextDualEncoderModel](https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder#visiontextdualencoder). | |
| The `VisionTextDualEncoderModel` class lets you load any vision and text encoder model to create a dual encoder. | |
| Here is an example of how to load the model using pre-trained vision and text models. | |
| ```python3 | |
| from transformers import ( | |
| VisionTextDualEncoderModel, | |
| VisionTextDualEncoderProcessor, | |
| AutoTokenizer, | |
| AutoImageProcessor | |
| ) | |
| model = VisionTextDualEncoderModel.from_vision_text_pretrained( | |
| "openai/clip-vit-base-patch32", "FacebookAI/roberta-base" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base") | |
| image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32") | |
| processor = VisionTextDualEncoderProcessor(image_processor, tokenizer) | |
| # save the model and processor | |
| model.save_pretrained("clip-roberta") | |
| processor.save_pretrained("clip-roberta") | |
| ``` | |
| This loads both the text and vision encoders using pre-trained weights, the projection layers are randomly | |
| initialized except for CLIP's vision model. If you use CLIP to initialize the vision model then the vision projection weights are also | |
| loaded using the pre-trained weights. | |
| ### Train the model | |
| Finally, we can run the example script to train the model: | |
| ```bash | |
| python examples/pytorch/contrastive-image-text/run_clip.py \ | |
| --output_dir ./clip-roberta-finetuned \ | |
| --model_name_or_path ./clip-roberta \ | |
| --data_dir $PWD/data \ | |
| --dataset_name ydshieh/coco_dataset_script \ | |
| --dataset_config_name=2017 \ | |
| --image_column image_path \ | |
| --caption_column caption \ | |
| --remove_unused_columns=False \ | |
| --do_train --do_eval \ | |
| --per_device_train_batch_size="64" \ | |
| --per_device_eval_batch_size="64" \ | |
| --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \ | |
| --overwrite_output_dir \ | |
| --push_to_hub | |
| ``` | |