File size: 4,814 Bytes
af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac af46605 fb78dac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
language: en
license: mit
tags:
- clip
- vision-language
- image-text
- zero-shot
- retrieval
pipeline_tag: zero-shot-image-classification
---
# LongCLIP: Unlocking the Long-Text Capability of CLIP
[](https://arxiv.org/abs/2403.15378)
[](https://eccv2024.ecva.net/)
[](https://github.com/creative-graphic-design/longclip-transformers)
## Model Description
LongCLIP is an enhanced version of OpenAI's CLIP that extends the maximum input text length from **77 to 248 tokens**, enabling better understanding of detailed, long-form text descriptions. This model maintains CLIP's zero-shot capabilities while significantly improving performance on long-caption retrieval tasks.
### Key Features
- 🔥 **Extended Context Length**: 248 tokens (3.2× longer than original CLIP)
- 🔥 **Strong Performance**: +20% R@5 on long-caption retrieval, +6% on standard retrieval
- 🔥 **Plug-and-Play**: Drop-in replacement for CLIP in existing workflows
- 🔥 **Two Model Sizes**: Base (LongCLIP-B) and Large (LongCLIP-L)
### Model Variants
| Model | Text Encoder | Vision Encoder | Params | Projection Dim |
| -------------- | --------------- | ---------------- | ------ | -------------- |
| **LongCLIP-B** | 12 layers, 512d | 12 layers, 768d | ~150M | 512 |
| **LongCLIP-L** | 12 layers, 768d | 24 layers, 1024d | ~430M | 768 |
## Uses
### Direct Use
LongCLIP can be used for:
- **Zero-shot image classification** with detailed text descriptions
- **Image-text retrieval** with long, descriptive captions
- **Text-to-image generation** (e.g., Stable Diffusion XL integration)
- **Visual question answering** with complex queries
### Downstream Use
LongCLIP serves as a backbone for:
- Vision-language models requiring long text understanding
- Multimodal retrieval systems
- Content-based image search engines
- Automated image captioning evaluation
## How to Use
### Installation
```bash
pip install "transformers[torch,torch-vision]"
```
### Quick Start
```python
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModel.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"creative-graphic-design/LongCLIP-B",
trust_remote_code=True
)
# Prepare inputs
image = Image.open("your_image.jpg")
texts = [
"A man is crossing the street with a red car parked nearby.",
"A man is driving a car in an urban scene."
]
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
max_length=248,
padding="max_length"
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)
print("Probabilities:", probs)
```
### Advanced Usage: Feature Extraction
```python
# Extract features separately (unnormalized)
text_inputs = processor(text=texts, return_tensors="pt", max_length=248, padding="max_length")
image_inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
text_features = model.get_text_features(**text_inputs)
image_features = model.get_image_features(**image_inputs)
# Compute similarity (like original CLIP)
logits = image_features @ text_features.T
probs = logits.softmax(dim=-1)
```
### Comparison with Original CLIP
```python
# Original CLIP: max 77 tokens
clip_text = "A cat"
# LongCLIP: up to 248 tokens
longclip_text = "A fluffy orange tabby cat with green eyes is sitting on a wooden table near a window, with sunlight streaming through the curtains in the background, creating a warm and cozy atmosphere in a modern living room."
# LongCLIP can handle both short and long texts effectively!
```
## Citation
If you use LongCLIP in your research, please cite:
```bibtex
@inproceedings{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Zhang, Beichen and Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Wang, Jiaqi},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
```
## License
This model is released under the MIT License, consistent with the original CLIP model.
## Acknowledgments
- **OpenAI CLIP**: Foundation model and architecture
- **Original Authors**: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
## Model Card Contact
For questions and feedback, please open an issue on the [GitHub repository](https://github.com/creative-graphic-design/longclip-transformers).
|