Ligul
/

capri

image-captioning

vision-language

Model card Files Files and versions

capri / README.md

Ligul's picture

Upload folder using huggingface_hub

38f3364 verified about 1 month ago

|

history blame contribute delete

3.43 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- image-captioning
	- multimodal
	- vision-language
	- qwen2
	- siglip
	datasets:
	- merve/coco
	pipeline_tag: image-to-text
	---

	# Capri

	Capri is a compact image captioning model designed for high-throughput, plain-language descriptions.
	It supports two inference paths: direct image input or precomputed SigLIP2 pooled embeddings.

	The project started from a practical pipeline constraint: existing captioning models were either too slow or too weak for reliable image understanding. That constraint sparked the idea for Capri: since SigLIP embeddings were already computed upstream, why not pair them with a small LLM decoder and get both strong visual representations and fast text generation?

	The name comes from the small Italian island of Capri and also hints at the goal of the project: a small CAPtioner with Rapid Inference.

	## Model Architecture

	- Vision encoder: `google/siglip2-base-patch16-224` (pooled embeddings)
	- Projector: MLP `768 -> 3072 -> 896`
	- Decoder: `Qwen/Qwen2.5-0.5B`
	- Adaptation: LoRA on `q_proj` and `v_proj`

	## Load Modes

	Embedding-only mode keeps SigLIP out of downloads and VRAM:

	```python
	from transformers import AutoModel, AutoProcessor
	import torch

	processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True)
	model = AutoModel.from_pretrained(
	"Ligul/capri",
	trust_remote_code=True,
	load_vision_tower=False,
	torch_dtype=torch.bfloat16,
	)

	inputs = processor(
	pooled_embeddings=torch.randn(2, 768),
	return_tensors="pt",
	)
	captions = model.generate_captions(
	pooled_embeddings=inputs["pooled_embeddings"],
	processor=processor,
	max_new_tokens=32,
	decode_batch_size=2048,
	)
	```

	Image mode loads SigLIP lazily:

	```python
	from PIL import Image
	from transformers import AutoModel, AutoProcessor
	import torch

	processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True)
	model = AutoModel.from_pretrained(
	"Ligul/capri",
	trust_remote_code=True,
	load_vision_tower=True,
	torch_dtype=torch.bfloat16,
	)

	image = Image.open("example.jpg").convert("RGB")
	captions = model.generate_captions(
	images=[image],
	processor=processor,
	max_new_tokens=32,
	vision_batch_size=64,
	decode_batch_size=2048,
	)
	```

	`generate()` is still available for low-level token generation if you want raw token ids.

	## Batch Guidance

	Use different knobs for the two stages:

	- `vision_batch_size`: moderate, image preprocessing + SigLIP is the expensive vision pass
	- `decode_batch_size`: much larger, pooled embeddings are tiny and Qwen generation batches well

	Reasonable defaults:

	- `vision_batch_size=64`
	- `decode_batch_size=1024`

	On larger GPUs, decode often scales to `2048+`.

	## Attribution

	Trained on captions from the [COCO 2017](https://cocodataset.org/) dataset.

	- Annotations © COCO Consortium, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
	- Images sourced from Flickr under their respective licenses; the dataset as a whole is not cleared for unrestricted commercial use

	> Lin, T.-Y., et al. "Microsoft COCO: Common Objects in Context." ECCV 2014. [arXiv:1405.0312](https://arxiv.org/abs/1405.0312)

	Built on top of:
	- [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) - Apache 2.0
	- [google/siglip2-base-patch16-224](https://huggingface.co/google/siglip2-base-patch16-224) - Apache 2.0