Hamza66628's picture
Upload README.md with huggingface_hub
36a50e8 verified
metadata
license: mit
tags:
  - image-captioning
  - clip
  - gpt-2
  - computer-vision
  - nlp
  - clipcap

CLIP Prefix Caption - Conceptual Captions Model

Image captioning model based on CLIP and GPT-2, trained on Conceptual Captions dataset.

Model Details

  • Model Type: CLIP Prefix Captioning
  • Architecture: CLIP Vision Encoder + MLP Mapping + GPT-2 Text Decoder
  • Dataset: Conceptual Captions
  • Prefix Length: 10 tokens
  • CLIP Model: ViT-B/32
  • GPT-2 Model: gpt2

Usage

See the test notebook for usage examples.

Files

  • model.pt: Model checkpoint (state_dict)

Citation

If you use this model, please cite:

@article{mokady2021clipcap,
  title={ClipCap: CLIP Prefix for Image Captioning},
  author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
  journal={arXiv preprint arXiv:2111.09734},
  year={2021}
}