google
/

tipsv2-g14

+---
+license: apache-2.0
+tags:
+- vision
+- image-text
+- contrastive-learning
+- zero-shot
+- feature-extraction
+library_name: transformers
+pipeline_tag: zero-shot-image-classification
+---
+# TIPSv2 — G/14
+TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Giant variant with 1.1B vision params and 389M text params.
+| Variant | Vision params | Text params | Embed dim |
+|---------|--------------|-------------|-----------|
+| B/14 | 86M | 110M | 768 |
+| L/14 | 303M | 184M | 1024 |
+| SO400m/14 | 412M | 448M | 1152 |
+| g/14 | 1.1B | 389M | 1536 |
+## Usage
+```bash
+pip install transformers torch torchvision sentencepiece
+```
+### Load the model
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("google/tipsv2-g14", trust_remote_code=True)
+model.eval()
+```
+### Encode images
+Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
+```python
+from torchvision import transforms
+from PIL import Image
+transform = transforms.Compose([
+    transforms.Resize((448, 448)),
+    transforms.ToTensor(),
+])
+image = transform(Image.open("photo.jpg")).unsqueeze(0)
+out = model.encode_image(image)
+out.cls_token      # (B, 1, 1536)
+out.patch_tokens   # (B, N, 1536)
+```
+### Encode text
+```python
+text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
+# (2, 1536)
+```
+### Zero-shot classification
+```python
+import torch.nn.functional as F
+cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
+text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
+similarity = cls @ text_emb.T
+prediction = similarity.argmax(dim=-1)
+```
+### GPU inference
+```python
+model = model.cuda()
+out = model.encode_image(image.cuda())
+text_emb = model.encode_text(["a city"])
+```
+## Model details
+- **Architecture**: ViT vision encoder (40 layers) + Transformer text encoder (12 layers)
+- **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
+- **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
+- **Patch size**: 14x14 pixels
+## License
+Apache 2.0