|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- image-captioning |
|
|
- clip |
|
|
- gpt2 |
|
|
- vision-language |
|
|
--- |
|
|
|
|
|
# CLIP Prefix Caption Model - COCO |
|
|
|
|
|
This model generates captions for images using CLIP image embeddings and GPT-2 language model. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: CLIP Prefix Caption |
|
|
- **Dataset**: COCO |
|
|
- **Prefix Length**: 10 |
|
|
- **CLIP Model**: ViT-B/32 |
|
|
- **Language Model**: GPT-2 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import torch |
|
|
from transformers import GPT2Tokenizer, GPT2LMHeadModel |
|
|
import clip |
|
|
|
|
|
# Load model |
|
|
checkpoint_path = hf_hub_download( |
|
|
repo_id="Hamza66628/clip-prefix-caption-coco", |
|
|
filename="model.pt" |
|
|
) |
|
|
checkpoint = torch.load(checkpoint_path, map_location="cpu") |
|
|
|
|
|
# Initialize model (use same architecture as training) |
|
|
model = ClipCaptionModel(prefix_length=10) |
|
|
model.load_state_dict(checkpoint, strict=False) |
|
|
model.eval() |
|
|
|
|
|
# Generate caption |
|
|
# (See full usage in the notebook) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original CLIP Prefix Caption paper. |
|
|
|