--- license: mit tags: - image-captioning - clip - gpt2 - vision-language --- # CLIP Prefix Caption Model - COCO This model generates captions for images using CLIP image embeddings and GPT-2 language model. ## Model Details - **Model Type**: CLIP Prefix Caption - **Dataset**: COCO - **Prefix Length**: 10 - **CLIP Model**: ViT-B/32 - **Language Model**: GPT-2 ## Usage ```python from huggingface_hub import hf_hub_download import torch from transformers import GPT2Tokenizer, GPT2LMHeadModel import clip # Load model checkpoint_path = hf_hub_download( repo_id="Hamza66628/clip-prefix-caption-coco", filename="model.pt" ) checkpoint = torch.load(checkpoint_path, map_location="cpu") # Initialize model (use same architecture as training) model = ClipCaptionModel(prefix_length=10) model.load_state_dict(checkpoint, strict=False) model.eval() # Generate caption # (See full usage in the notebook) ``` ## Citation If you use this model, please cite the original CLIP Prefix Caption paper.