Fix broken image URL and add missing .cpu() in code snippets

44ff952 verified 1 day ago

4.35 kB

license: apache-2.0
tags:
  - vision
  - image-text
  - contrastive-learning
  - zero-shot
  - feature-extraction
library_name: transformers
pipeline_tag: zero-shot-image-classification

TIPSv2 — SO400m/14

TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the SO400m variant with 412M vision params and 448M text params. Try the code snippets below or check out the GitHub repo for more use cases and visualizations, including zero-shot segmentation.

Variant	Vision params	Text params	Embed dim	DPT Heads
B/14	86M	110M	768	B/14-dpt
L/14	303M	184M	1024	L/14-dpt
SO400m/14	412M	448M	1152	SO400m/14-dpt
g/14	1.1B	389M	1536	g/14-dpt

Usage

pip install transformers torch torchvision sentencepiece scikit-learn

Load the model

from transformers import AutoModel

model = AutoModel.from_pretrained("google/tipsv2-so400m14", trust_remote_code=True)
model.eval()

Encode images

Images should be tensors in [0, 1] range (just ToTensor(), no ImageNet normalization).

from torchvision import transforms
from PIL import Image
import requests

transform = transforms.Compose([
    transforms.Resize((448, 448)),
    transforms.ToTensor(),
])

url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = transform(image).unsqueeze(0)
out = model.encode_image(pixel_values)

print(out.cls_token.shape)     # (1, 1, 1152) — global image embedding
print(out.patch_tokens.shape)  # (1, 1024, 1152) — per-patch spatial features

Encode text

text_emb = model.encode_text(["a photo of a bus", "a photo of a dog"])
print(text_emb.shape)  # (2, 1152) — one embedding per query

Zero-shot classification

import torch.nn.functional as F

classes = ["bus", "car", "dog", "cat"]
cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
text_emb = F.normalize(model.encode_text(classes), dim=-1)
similarity = cls @ text_emb.T
print(classes[similarity.argmax()])  # bus — predicted class

Visualize spatial features

import numpy as np
from sklearn.decomposition import PCA

spatial = out.patch_tokens.reshape(1, 32, 32, 1152)
feat = spatial[0].detach().cpu().numpy().reshape(-1, 1152)
rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
rgb = 1 / (1 + np.exp(-2.0 * rgb))  # sigmoid for [0, 1] range with good contrast
print(rgb.shape)  # (32, 32, 3) — PCA of patch features as RGB

GPU inference

model = model.cuda()
out = model.encode_image(pixel_values.cuda())
text_emb = model.encode_text(["a city"])

Model details

Architecture: ViT vision encoder (27 layers) + Transformer text encoder (27 layers)
Image preprocessing: resize to any resolution, convert to [0, 1] (no ImageNet normalization)
Text preprocessing: SentencePiece tokenizer, lowercased, max 64 tokens
Patch size: 14x14 pixels

License

Apache 2.0

Citation

@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}