| # Model Card for vit-gpt2-image-captioning | |
| ## Model Details | |
| This model is a VisionEncoderDecoderModel using a ViT encoder and GPT-2 decoder to generate captions for images. It was fine-tuned by adding context information to assist in generating meaningful captions. | |
| - **Base Model**: nlpconnect/vit-gpt2-image-captioning | |
| - **Processor**: ViTImageProcessor | |
| - **Tokenizer**: GPT-2 Tokenizer | |
| - **Generated Caption Example**: "{generated_text}" | |
| ## Intended Use | |
| This model is intended for generating captions for stock-related images, with an initial context provided for more accurate descriptions. | |
| ## Limitations | |
| - The model might generate incorrect or biased descriptions depending on the input image or context. | |
| - It requires specific context inputs for the best performance. | |
| ## How to Use | |
| ```python | |
| from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer | |
| model = VisionEncoderDecoderModel.from_pretrained("your_username/your_model_name") | |
| processor = ViTImageProcessor.from_pretrained("your_username/your_model_name") | |
| tokenizer = AutoTokenizer.from_pretrained("your_username/your_model_name") | |
| ``` | |
| ## License | |
| This model is licensed under the same terms as the original nlpconnect/vit-gpt2-image-captioning. | |