|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# DiffCLIP: Differential Attention Meets CLIP |
|
|
|
|
|
This repository contains the DiffCLIP model as presented in [DiffCLIP: Differential Attention Meets CLIP](https://huggingface.co/papers/2503.06626). |
|
|
|
|
|
Project Page: https://hammoudhasan.github.io/DiffCLIP |
|
|
|
|
|
Code: https://github.com/hammoudhasan/DiffCLIP |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/hammoudhasan/DiffCLIP.git |
|
|
cd DiffCLIP |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from diff_clip import DiffCLIP_VITB16 |
|
|
|
|
|
# Create model |
|
|
model = DiffCLIP_VITB16() |
|
|
|
|
|
# Process image and text |
|
|
image = torch.randn(1, 3, 224, 224) |
|
|
text = torch.randint(0, 49408, (1, 77)) # Tokenized text |
|
|
|
|
|
# Get embeddings |
|
|
with torch.no_grad(): |
|
|
outputs = model(image, text) |
|
|
|
|
|
print(outputs["image_embed"].shape) # Should be [1, 512] |
|
|
print(outputs["text_embed"].shape) # Should be [1, 512] |
|
|
``` |
|
|
|
|
|
### Zero-Shot Classification |
|
|
|
|
|
You can use the provided `test_models.py` script to perform zero-shot classification. See the [GitHub README](https://github.com/hammoudhasan/DiffCLIP) for details. |