ayushman72 commited on
Commit
2f9fa75
·
verified ·
1 Parent(s): 7657166

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. README.md +6 -64
  2. config.json +12 -0
  3. model.safetensors +3 -0
README.md CHANGED
@@ -1,68 +1,10 @@
1
  ---
2
- language:
3
- - en
4
- metrics:
5
- - bleu
6
- - meteor
7
- base_model:
8
- - openai-community/gpt2
9
- - google/vit-base-patch16-224
10
  tags:
11
- - image captioing
12
- - vit
13
- - gpt
14
- - gpt2
15
- - torch
16
- datasets:
17
- - nlphuji/flickr30k
18
  ---
19
- # Image Captioning using ViT and GPT2 architecture
20
 
21
- This is my attempt to make a transformer model which takes image as the input and provides a caption for the image
22
-
23
- ## Model Architecture
24
- It comprises of 12 ViT encoder and 12 GPT2 decoders
25
-
26
- ![Model Architecture](images/model.png)
27
-
28
- ## Training
29
- The model was trained on the dataset Flickr30k which comprises of 30k images and 5 captions for each image
30
- The model was trained for 8 epochs (which took 10hrs on kaggle's P100 GPU)
31
-
32
- ## Results
33
- The model acieved a BLEU-4 score of 0.2115, CIDEr score of 0.4, METEOR score of 0.25, and SPICE score of 0.19 on the Flickr8k dataset
34
-
35
- These are the loss curves.
36
-
37
-
38
- ![Loss graph](images/loss.png)
39
- ![perplexity graph](images/perplexity.png)
40
-
41
- ## Predictions
42
- To predict your own images download the models.py, predict.py and the requirements.txt and then run the following commands->
43
-
44
- `pip install -r requirements.txt`
45
-
46
- `python predict.py`
47
-
48
- *Predicting for the first time will take time as it has to download the model weights (1GB)*
49
-
50
- Here are a few examples of the prediction done on the Validation dataset
51
-
52
- ![Test 1](images/test1.png)
53
- ![Test 2](images/test2.png)
54
- ![Test 3](images/test3.png)
55
- ![Test 4](images/test4.png)
56
- ![Test 5](images/test5.png)
57
- ![Test 6](images/test6.png)
58
- ![Test 7](images/test7.png)
59
- ![Test 8](images/test8.png)
60
- ![Test 9](images/test9.png)
61
-
62
- As we can see these are not the most amazing predictions. The performance could be improved by training it further and using an even bigger dataset like MS COCO (500k captioned images)
63
-
64
- ## FAQ
65
-
66
- Check the [full notebook](./imagecaptioning.ipynb) or [Kaggle](https://www.kaggle.com/code/ayushman72/imagecaptioning)
67
-
68
- Download the [weights](https://drive.google.com/file/d/1X51wAI7Bsnrhd2Pa4WUoHIXvvhIcRH7Y/view?usp=drive_link) of the model
 
1
  ---
2
+ pipeline_tag: text-to-image
 
 
 
 
 
 
 
3
  tags:
4
+ - model_hub_mixin
5
+ - pytorch_model_hub_mixin
 
 
 
 
 
6
  ---
 
7
 
8
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
+ - Library: https://huggingface.co/ayushman72/ImageCaptioning
10
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_dropout": 0.1,
3
+ "depth": 12,
4
+ "emb_dropout": 0.1,
5
+ "embed_dim": 768,
6
+ "mlp_dropout": 0.1,
7
+ "mlp_ratio": 4,
8
+ "num_heads": 12,
9
+ "residual_dropout": 0.1,
10
+ "seq_len": 1024,
11
+ "vocab_size": 50257
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfe0f8b507dd74c679453125c45fd63dd5fb4c2d563f374b603dd6d6939b9a4c
3
+ size 1004789512