Mozilla
/

distilvit

@@ -1,66 +1,82 @@
----
-license: apache-2.0
-base_model: mozilla/distilvit
-tags:
-- generated_from_trainer
-metrics:
-- rouge
-model-index:
-- name: distilvit
-  results: []
----
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# distilvit
-This model is a fine-tuned version of [mozilla/distilvit](https://huggingface.co/mozilla/distilvit) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Gen Len: 10.6487
-- Loss: 0.1739
-- Meteor: 0.4120
-- Rouge1: 50.0916
-- Rouge2: 24.7223
-- Rougel: 46.9416
-- Rougelsum: 46.9372
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 5e-05
-- train_batch_size: 100
-- eval_batch_size: 100
-- seed: 42
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- num_epochs: 1
-### Training results
-| Training Loss | Epoch  | Step | Gen Len | Validation Loss | Meteor | Rouge1  | Rouge2  | Rougel  | Rougelsum |
-|:-------------:|:------:|:----:|:-------:|:---------------:|:------:|:-------:|:-------:|:-------:|:---------:|
-| No log        | 0.3891 | 100  | 10.4163 | 0.1764          | 0.4117 | 50.0198 | 24.6331 | 46.9071 | 46.8907   |
-| No log        | 0.7782 | 200  | 10.6487 | 0.1739          | 0.4120 | 50.0916 | 24.7223 | 46.9416 | 46.9372   |
-### Framework versions
-- Transformers 4.40.2
-- Pytorch 2.3.0+cu121
-- Datasets 2.19.1
-- Tokenizers 0.19.1

+---
+tags:
+- image-to-text
+- image-captioning
+license: apache-2.0
+metrics:
+- rouge
+datasets:
+- nlphuji/flickr30k
+widget:
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
+  example_title: Savanna
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
+  example_title: Football Match
+- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
+  example_title: Airport
+base_model:
+- google/vit-base-patch16-224-in21k
+model-index:
+- name: mozilla/distilvit
+  results:
+  - task:
+      type: image-to-text
+      name: Image To Text
+    dataset:
+      name: nlphuji/flickr30k
+      type: nlphuji/flickr30k
+    metrics:
+    - name: ROUGE-1
+      type: rouge
+      value: 43.006
+      verified: true
+    - name: ROUGE-2
+      type: rouge
+      value: 16.9939
+      verified: true
+    - name: ROUGE-L
+      type: rouge
+      value: 38.8923
+      verified: true
+    - name: ROUGE-LSUM
+      type: rouge
+      value: 38.8877
+      verified: true
+    - name: loss
+      type: loss
+      value: 0.19939416646957397
+    - name: gen_len
+      type: gen_len
+      value: 11.327256736227712
+      verified: true
+---
+# distilvit
+This model is a work in progress.   Fine-tuned version of those base models:
+- a VIT model for the image encoder:  https://huggingface.co/google/vit-base-patch16-224-in21k
+- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
+This model was trained on:
+- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
+- COCO 2017: https://cocodataset.org
+You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
+It was then further fine-tuned on :
+- Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
+- DocOrNot: https://huggingface.co/datasets/Mozilla/docornot
+You can find the code used to create the model here: https://github.com/mozilla/distilvit
+### Framework versions
+- Transformers 4.40.2
+- Pytorch 2.3.0+cu121
+- Datasets 2.19.1
+- Tokenizers 0.19.1