File size: 4,342 Bytes
7ff1d09
 
 
 
 
 
 
 
 
 
 
 
c7d0623
7ff1d09
18d5063
7ff1d09
cf63d64
 
 
 
 
 
7ff1d09
 
 
 
ee929f8
7ff1d09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee929f8
7ff1d09
 
 
 
 
 
a4f58f8
ee929f8
 
 
7ff1d09
d867321
 
7ff1d09
 
 
 
 
be4b3bf
d867321
7ff1d09
 
 
 
 
 
 
be4b3bf
7ff1d09
c1eb665
7ff1d09
be4b3bf
7ff1d09
 
ee929f8
 
c7d0623
 
 
 
 
a4f58f8
4842da4
 
d867321
c7d0623
 
7ff1d09
 
 
 
ee929f8
7ff1d09
 
 
 
 
 
 
 
 
 
 
 
 
0a0e878
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: apache-2.0
tags:
- vision
- image-text
- contrastive-learning
- zero-shot
- feature-extraction
library_name: transformers
pipeline_tag: zero-shot-image-classification
---

# TIPSv2 — g/14

TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Giant variant with 1.1B vision params and 389M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.

| Variant | Vision params | Text params | Embed dim | DPT Heads |
|---------|--------------|-------------|-----------|-----------|
| [B/14](https://huggingface.co/google/tipsv2-b14) | 86M | 110M | 768 | [B/14-dpt](https://huggingface.co/google/tipsv2-b14-dpt) |
| [L/14](https://huggingface.co/google/tipsv2-l14) | 303M | 184M | 1024 | [L/14-dpt](https://huggingface.co/google/tipsv2-l14-dpt) |
| [SO400m/14](https://huggingface.co/google/tipsv2-so400m14) | 412M | 448M | 1152 | [SO400m/14-dpt](https://huggingface.co/google/tipsv2-so400m14-dpt) |
| [g/14](https://huggingface.co/google/tipsv2-g14) | 1.1B | 389M | 1536 | [g/14-dpt](https://huggingface.co/google/tipsv2-g14-dpt) |

## Usage

```bash
pip install transformers torch torchvision sentencepiece scikit-learn
```

### Load the model

```python
from transformers import AutoModel

model = AutoModel.from_pretrained("google/tipsv2-g14", trust_remote_code=True)
model.eval()
```

### Encode images

Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).

```python
from torchvision import transforms
from PIL import Image
import requests

transform = transforms.Compose([
    transforms.Resize((448, 448)),
    transforms.ToTensor(),
])

url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)

print(out.cls_token.shape)     # (1, 1, 1536) — global image embedding
print(out.patch_tokens.shape)  # (1, 1024, 1536) — per-patch spatial features
```

### Encode text

```python
text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape)  # (2, 1536) — one embedding per query
```

### Zero-shot classification

```python
import torch.nn.functional as F

classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()])  # bus — predicted class
```

### Visualize spatial features

```python
import numpy as np
from sklearn.decomposition import PCA

spatial = out.patch_tokens.reshape(1, 32, 32, 1536)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 1536)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb))  # sigmoid for [0, 1] range with good contrast
print(rgb.shape)  # (32, 32, 3) — PCA of patch features as RGB
```

### GPU inference

```python
model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])
```

## Model details

- **Architecture**: ViT vision encoder (40 layers) + Transformer text encoder (12 layers)
- **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
- **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
- **Patch size**: 14x14 pixels

## License

Apache 2.0

## Citation

```bibtex
@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
```