google
/

tipsv2-l14

+---
+license: apache-2.0
+tags:
+- vision
+- image-text
+- contrastive-learning
+- zero-shot
+- feature-extraction
+library_name: transformers
+pipeline_tag: zero-shot-image-classification
+---
+# TIPSv2 — L/14
+TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of
+contrastive vision-language models that produce spatially rich image features
+aligned with text embeddings. This is the **Large** variant with a ViT-L/14
+vision encoder and a matching text encoder.
+| Variant | Vision params | Text params | Embed dim |
+|---------|--------------|-------------|-----------|
+| B/14 | 86M | 110M | 768 |
+| **L/14 (this)** | **303M** | **184M** | **1024** |
+| SO400m/14 | 412M | 448M | 1152 |
+| g/14 | 1.1B | 389M | 1536 |
+## Usage
+### Install dependencies
+```bash
+pip install transformers torch torchvision sentencepiece
+```
+### Load the model
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("google/tipsv2-l14", trust_remote_code=True)
+model.eval()
+```
+### Encode images
+Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
+```python
+from torchvision import transforms
+from PIL import Image
+transform = transforms.Compose([
+    transforms.Resize((448, 448)),
+    transforms.ToTensor(),
+])
+image = transform(Image.open("photo.jpg")).unsqueeze(0)
+out = model.encode_image(image)
+out.cls_token      # (B, 1, 1024) — global CLS feature
+out.patch_tokens   # (B, 1024, 1024) — spatial features (32x32 grid)
+```
+### Encode text
+Pass strings directly — tokenization is handled internally.
+```python
+text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
+# text_emb: (2, 1024)
+```
+### Zero-shot classification
+```python
+import torch.nn.functional as F
+cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
+text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
+similarity = cls @ text_emb.T  # (B, num_classes)
+prediction = similarity.argmax(dim=-1)
+```
+### Zero-shot segmentation
+Use spatial patch tokens for dense prediction tasks.
+```python
+import numpy as np
+from sklearn.decomposition import PCA
+# Spatial features
+spatial = out.patch_tokens.reshape(1, 32, 32, 1024)  # (B, H, W, D)
+# PCA visualization
+feat = spatial[0].detach().numpy().reshape(-1, 1024)
+rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
+```
+### GPU inference
+```python
+model = model.cuda()
+out = model.encode_image(image.cuda())
+text_emb = model.encode_text(["a city"])  # auto-moved to model device
+```
+## Model details
+- **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
+- **Pre-training**: Contrastive image-text learning with spatial awareness
+- **Image preprocessing**: Resize to 448x448, convert to `[0, 1]` (no normalization)
+- **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
+- **Patch size**: 14x14 pixels
+- **Register tokens**: 1
+- **Position embeddings**: Interpolated at inference, supports any resolution
+## License
+Apache 2.0