google
/

tipsv2-b14

@@ -12,7 +12,7 @@ pipeline_tag: zero-shot-image-classification
 # TIPSv2 — B/14
-TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params.
 | Variant | Vision params | Text params | Embed dim | DPT Heads |
 |---------|--------------|-------------|-----------|-----------|
@@ -78,16 +78,6 @@ similarity = cls @ text_emb.T
 print(classes[similarity.argmax()])  # bus — predicted class
 ```
-### Zero-shot segmentation
-```python
-classes = ["bus", "snow", "mountain", "house", "road"]
-patch_feats = F.normalize(out.patch_tokens, dim=-1)
-text_emb = F.normalize(model.encode_text(classes), dim=-1)
-seg_map = (patch_feats @ text_emb.T).reshape(32, 32, len(classes)).argmax(dim=-1)
-print(seg_map.shape)  # (32, 32) — per-patch class prediction
-```
 ### Visualize spatial features
 ```python

 # TIPSv2 — B/14
+TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.
 | Variant | Vision params | Text params | Embed dim | DPT Heads |
 |---------|--------------|-------------|-----------|-----------|
 print(classes[similarity.argmax()])  # bus — predicted class
 ```
 ### Visualize spatial features
 ```python