google
/

tipsv2-so400m14

@@ -12,7 +12,7 @@ pipeline_tag: zero-shot-image-classification
 # TIPSv2 — SO400m/14
-TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the SO400m variant with 412M vision params and 448M text params.
 | Variant | Vision params | Text params | Embed dim | DPT Heads |
 |---------|--------------|-------------|-----------|-----------|
@@ -78,16 +78,6 @@ similarity = cls @ text_emb.T
 print(classes[similarity.argmax()])  # bus — predicted class
 ```
-### Zero-shot segmentation
-```python
-classes = ["bus", "snow", "mountain", "house", "road"]
-patch_feats = F.normalize(out.patch_tokens, dim=-1)
-text_emb = F.normalize(model.encode_text(classes), dim=-1)
-seg_map = (patch_feats @ text_emb.T).reshape(32, 32, len(classes)).argmax(dim=-1)
-print(seg_map.shape)  # (32, 32) — per-patch class prediction
-```
 ### Visualize spatial features
 ```python

 # TIPSv2 — SO400m/14
+TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the SO400m variant with 412M vision params and 448M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.
 | Variant | Vision params | Text params | Embed dim | DPT Heads |
 |---------|--------------|-------------|-----------|-----------|
 print(classes[similarity.argmax()])  # bus — predicted class
 ```
 ### Visualize spatial features
 ```python