tarekziade
/

test-push

vision-encoder-decoder

image-text-to-text

image-captioning

Eval Results (legacy)

Model card Files Files and versions

distilvit

This model is a work in progress. Fine-tuned version of those base models:

a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2

This model was trained on:

Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
COCO 2017: https://cocodataset.org

You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.

It was then further fine-tuned on :

Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
DocOrNot: https://huggingface.co/datasets/Mozilla/docornot

You can find the code used to create the model here: https://github.com/mozilla/distilvit

Framework versions

Transformers 4.40.2
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Downloads last month: 2

Model tree for tarekziade/test-push

Base model

google/vit-base-patch16-224-in21k

Quantized

(10)

this model

Dataset used to train tarekziade/test-push

Evaluation results

ROUGE-1 on nlphuji/flickr30k
self-reported

43.006
ROUGE-2 on nlphuji/flickr30k
self-reported

16.994
ROUGE-L on nlphuji/flickr30k
self-reported

38.892
ROUGE-LSUM on nlphuji/flickr30k
self-reported

38.888
loss on nlphuji/flickr30k
self-reported

0.199
gen_len on nlphuji/flickr30k
self-reported

11.327