--- pipeline_tag: image-text-to-text library_name: transformers license: mit --- # DiffCLIP: Differential Attention Meets CLIP This repository contains the DiffCLIP model as presented in [DiffCLIP: Differential Attention Meets CLIP](https://huggingface.co/papers/2503.06626). Project Page: https://hammoudhasan.github.io/DiffCLIP Code: https://github.com/hammoudhasan/DiffCLIP ## How to Use ### Installation ```bash # Clone the repository git clone https://github.com/hammoudhasan/DiffCLIP.git cd DiffCLIP # Install dependencies pip install -r requirements.txt ``` ### Basic Usage ```python import torch from diff_clip import DiffCLIP_VITB16 # Create model model = DiffCLIP_VITB16() # Process image and text image = torch.randn(1, 3, 224, 224) text = torch.randint(0, 49408, (1, 77)) # Tokenized text # Get embeddings with torch.no_grad(): outputs = model(image, text) print(outputs["image_embed"].shape) # Should be [1, 512] print(outputs["text_embed"].shape) # Should be [1, 512] ``` ### Zero-Shot Classification You can use the provided `test_models.py` script to perform zero-shot classification. See the [GitHub README](https://github.com/hammoudhasan/DiffCLIP) for details.