Hamza66628
/

clip-prefix-caption-coco

image-captioning

vision-language

Model card Files Files and versions

clip-prefix-caption-coco / README.md

Hamza66628's picture

Add README.md

8a011f8 verified about 1 month ago

|

history blame contribute delete

1.01 kB

	---
	license: mit
	tags:
	- image-captioning
	- clip
	- gpt2
	- vision-language
	---

	# CLIP Prefix Caption Model - COCO

	This model generates captions for images using CLIP image embeddings and GPT-2 language model.

	## Model Details

	- Model Type: CLIP Prefix Caption
	- Dataset: COCO
	- Prefix Length: 10
	- CLIP Model: ViT-B/32
	- Language Model: GPT-2

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	import torch
	from transformers import GPT2Tokenizer, GPT2LMHeadModel
	import clip

	# Load model
	checkpoint_path = hf_hub_download(
	repo_id="Hamza66628/clip-prefix-caption-coco",
	filename="model.pt"
	)
	checkpoint = torch.load(checkpoint_path, map_location="cpu")

	# Initialize model (use same architecture as training)
	model = ClipCaptionModel(prefix_length=10)
	model.load_state_dict(checkpoint, strict=False)
	model.eval()

	# Generate caption
	# (See full usage in the notebook)
	```

	## Citation

	If you use this model, please cite the original CLIP Prefix Caption paper.