Image-to-Text
Transformers
PyTorch
ONNX
vision-encoder-decoder
image-text-to-text
image-captioning
Eval Results (legacy)
# Load model directly
from transformers import AutoTokenizer, AutoModelForImageTextToText
tokenizer = AutoTokenizer.from_pretrained("tarekziade/test-push")
model = AutoModelForImageTextToText.from_pretrained("tarekziade/test-push")Quick Links
distilvit
This model is a work in progress. Fine-tuned version of those base models:
- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
This model was trained on:
- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
- COCO 2017: https://cocodataset.org
You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
It was then further fine-tuned on :
- Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
- DocOrNot: https://huggingface.co/datasets/Mozilla/docornot
You can find the code used to create the model here: https://github.com/mozilla/distilvit
Framework versions
- Transformers 4.40.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
- Downloads last month
- 8
Model tree for tarekziade/test-push
Base model
google/vit-base-patch16-224-in21kDataset used to train tarekziade/test-push
Evaluation results
- ROUGE-1 on nlphuji/flickr30kself-reported43.006
- ROUGE-2 on nlphuji/flickr30kself-reported16.994
- ROUGE-L on nlphuji/flickr30kself-reported38.892
- ROUGE-LSUM on nlphuji/flickr30kself-reported38.888
- loss on nlphuji/flickr30kself-reported0.199
- gen_len on nlphuji/flickr30kself-reported11.327
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="tarekziade/test-push")