| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - image-captioning |
| - multimodal |
| - vision-language |
| - qwen2 |
| - siglip |
| datasets: |
| - merve/coco |
| pipeline_tag: image-to-text |
| --- |
| |
| # Capri |
|
|
| Capri is a compact image captioning model designed for high-throughput, plain-language descriptions. |
| It supports two inference paths: direct image input or precomputed SigLIP2 pooled embeddings. |
|
|
| The project started from a practical pipeline constraint: existing captioning models were either too slow or too weak for reliable image understanding. That constraint sparked the idea for Capri: since SigLIP embeddings were already computed upstream, why not pair them with a small LLM decoder and get both strong visual representations and fast text generation? |
|
|
| The name comes from the small Italian island of Capri and also hints at the goal of the project: a small CAPtioner with Rapid Inference. |
|
|
| ## Model Architecture |
|
|
| - Vision encoder: `google/siglip2-base-patch16-224` (pooled embeddings) |
| - Projector: MLP `768 -> 3072 -> 896` |
| - Decoder: `Qwen/Qwen2.5-0.5B` |
| - Adaptation: LoRA on `q_proj` and `v_proj` |
|
|
| ## Load Modes |
|
|
| Embedding-only mode keeps SigLIP out of downloads and VRAM: |
|
|
| ```python |
| from transformers import AutoModel, AutoProcessor |
| import torch |
| |
| processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True) |
| model = AutoModel.from_pretrained( |
| "Ligul/capri", |
| trust_remote_code=True, |
| load_vision_tower=False, |
| torch_dtype=torch.bfloat16, |
| ) |
| |
| inputs = processor( |
| pooled_embeddings=torch.randn(2, 768), |
| return_tensors="pt", |
| ) |
| captions = model.generate_captions( |
| pooled_embeddings=inputs["pooled_embeddings"], |
| processor=processor, |
| max_new_tokens=32, |
| decode_batch_size=2048, |
| ) |
| ``` |
|
|
| Image mode loads SigLIP lazily: |
|
|
| ```python |
| from PIL import Image |
| from transformers import AutoModel, AutoProcessor |
| import torch |
| |
| processor = AutoProcessor.from_pretrained("Ligul/capri", trust_remote_code=True) |
| model = AutoModel.from_pretrained( |
| "Ligul/capri", |
| trust_remote_code=True, |
| load_vision_tower=True, |
| torch_dtype=torch.bfloat16, |
| ) |
| |
| image = Image.open("example.jpg").convert("RGB") |
| captions = model.generate_captions( |
| images=[image], |
| processor=processor, |
| max_new_tokens=32, |
| vision_batch_size=64, |
| decode_batch_size=2048, |
| ) |
| ``` |
|
|
| `generate()` is still available for low-level token generation if you want raw token ids. |
|
|
| ## Batch Guidance |
|
|
| Use different knobs for the two stages: |
|
|
| - `vision_batch_size`: moderate, image preprocessing + SigLIP is the expensive vision pass |
| - `decode_batch_size`: much larger, pooled embeddings are tiny and Qwen generation batches well |
|
|
| Reasonable defaults: |
|
|
| - `vision_batch_size=64` |
| - `decode_batch_size=1024` |
|
|
| On larger GPUs, decode often scales to `2048+`. |
|
|
| ## Attribution |
|
|
| Trained on captions from the [COCO 2017](https://cocodataset.org/) dataset. |
|
|
| - Annotations © COCO Consortium, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
| - Images sourced from Flickr under their respective licenses; the dataset as a whole is not cleared for unrestricted commercial use |
|
|
| > Lin, T.-Y., et al. "Microsoft COCO: Common Objects in Context." ECCV 2014. [arXiv:1405.0312](https://arxiv.org/abs/1405.0312) |
|
|
| Built on top of: |
| - [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) - Apache 2.0 |
| - [google/siglip2-base-patch16-224](https://huggingface.co/google/siglip2-base-patch16-224) - Apache 2.0 |
|
|