ayushman72
/

ImageCaptioning

Model card Files Files and versions

ayushman72 commited on Jan 6, 2025

Commit

2f9fa75

·

verified ·

1 Parent(s): 7657166

Push model using huggingface_hub.

Files changed (3) hide show

README.md +6 -64
config.json +12 -0
model.safetensors +3 -0

README.md CHANGED Viewed

@@ -1,68 +1,10 @@
 ---
-language:
-- en
-metrics:
-- bleu
-- meteor
-base_model:
-- openai-community/gpt2
-- google/vit-base-patch16-224
 tags:
-- image captioing
-- vit
-- gpt
-- gpt2
-- torch
-datasets:
-- nlphuji/flickr30k
 ---
-# Image Captioning using ViT and GPT2 architecture
-This is my attempt to make a transformer model which takes image as the input and provides a caption for the image
-## Model Architecture
-It comprises of 12 ViT encoder and 12 GPT2 decoders
-![Model Architecture](images/model.png)
-## Training
-The model was trained on the dataset Flickr30k which comprises of 30k images and 5 captions for each image
-The model was trained for 8 epochs (which took 10hrs on kaggle's P100 GPU)
-## Results
-The model acieved a BLEU-4 score of 0.2115, CIDEr score of 0.4, METEOR score of 0.25, and SPICE score of 0.19 on the Flickr8k dataset
-These are the loss curves.
-![Loss graph](images/loss.png)
-![perplexity graph](images/perplexity.png)
-## Predictions
-To predict your own images download the models.py, predict.py and the requirements.txt and then run the following commands->
-`pip install -r requirements.txt`
-`python predict.py`
-*Predicting for the first time will take time as it has to download the model weights (1GB)*
-Here are a few examples of the prediction done on the Validation dataset
-![Test 1](images/test1.png)
-![Test 2](images/test2.png)
-![Test 3](images/test3.png)
-![Test 4](images/test4.png)
-![Test 5](images/test5.png)
-![Test 6](images/test6.png)
-![Test 7](images/test7.png)
-![Test 8](images/test8.png)
-![Test 9](images/test9.png)
-As we can see these are not the most amazing predictions. The performance could be improved by training it further and using an even bigger dataset like MS COCO (500k captioned images)
-## FAQ
-Check the [full notebook](./imagecaptioning.ipynb) or [Kaggle](https://www.kaggle.com/code/ayushman72/imagecaptioning)
-Download the [weights](https://drive.google.com/file/d/1X51wAI7Bsnrhd2Pa4WUoHIXvvhIcRH7Y/view?usp=drive_link) of the model

 ---
+pipeline_tag: text-to-image
 tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
 ---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Library: https://huggingface.co/ayushman72/ImageCaptioning
+- Docs: [More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "attention_dropout": 0.1,
+  "depth": 12,
+  "emb_dropout": 0.1,
+  "embed_dim": 768,
+  "mlp_dropout": 0.1,
+  "mlp_ratio": 4,
+  "num_heads": 12,
+  "residual_dropout": 0.1,
+  "seq_len": 1024,
+  "vocab_size": 50257
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfe0f8b507dd74c679453125c45fd63dd5fb4c2d563f374b603dd6d6939b9a4c
+size 1004789512