File size: 4,353 Bytes
9513273 b706b80 9513273 8c6c309 9513273 d0e7527 9513273 2cf893a 9513273 2cf893a 9513273 245de45 2cf893a 9513273 670938a 9513273 6dd2600 670938a 9513273 6dd2600 9513273 365a9aa 9513273 6dd2600 9513273 2cf893a 6f6f8e8 245de45 d075dda 670938a 6f6f8e8 9513273 2cf893a 9513273 81aef2d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | ---
license: apache-2.0
tags:
- vision
- image-text
- contrastive-learning
- zero-shot
- feature-extraction
library_name: transformers
pipeline_tag: zero-shot-image-classification
arxiv: 2604.12012
---
# TIPSv2 — B/14
TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.
| Variant | Vision params | Text params | Embed dim | DPT Heads |
|---------|--------------|-------------|-----------|-----------|
| [B/14](https://huggingface.co/google/tipsv2-b14) | 86M | 110M | 768 | [B/14-dpt](https://huggingface.co/google/tipsv2-b14-dpt) |
| [L/14](https://huggingface.co/google/tipsv2-l14) | 303M | 184M | 1024 | [L/14-dpt](https://huggingface.co/google/tipsv2-l14-dpt) |
| [SO400m/14](https://huggingface.co/google/tipsv2-so400m14) | 412M | 448M | 1152 | [SO400m/14-dpt](https://huggingface.co/google/tipsv2-so400m14-dpt) |
| [g/14](https://huggingface.co/google/tipsv2-g14) | 1.1B | 389M | 1536 | [g/14-dpt](https://huggingface.co/google/tipsv2-g14-dpt) |
## Usage
```bash
pip install transformers torch torchvision sentencepiece scikit-learn
```
### Load the model
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("google/tipsv2-b14", trust_remote_code=True)
model.eval()
```
### Encode images
Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
```python
from torchvision import transforms
from PIL import Image
import requests
transform = transforms.Compose([
transforms.Resize((448, 448)),
transforms.ToTensor(),
])
url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)
print(out.cls_token.shape) # (1, 1, 768) — global image embedding
print(out.patch_tokens.shape) # (1, 1024, 768) — per-patch spatial features
```
### Encode text
```python
text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape) # (2, 768) — one embedding per query
```
### Zero-shot classification
```python
import torch.nn.functional as F
classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()]) # bus — predicted class
```
### Visualize spatial features
```python
import numpy as np
from sklearn.decomposition import PCA
spatial = out.patch_tokens.reshape(1, 32, 32, 768)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 768)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb)) # sigmoid for [0, 1] range with good contrast
print(rgb.shape) # (32, 32, 3) — PCA of patch features as RGB
```
### GPU inference
```python
model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])
```
## Model details
- **Architecture**: ViT vision encoder (12 layers) + Transformer text encoder (12 layers)
- **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
- **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
- **Patch size**: 14x14 pixels
## License
Apache 2.0
## Citation
```bibtex
@inproceedings{cao2026tipsv2,
title = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
author = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
```
|