Zero-Shot Image Classification
Transformers
Safetensors
tipsv2
feature-extraction
vision
image-text
contrastive-learning
zero-shot
custom_code
Instructions to use google/tipsv2-so400m14 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/tipsv2-so400m14 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="google/tipsv2-so400m14", trust_remote_code=True) pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("google/tipsv2-so400m14", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Gabriele commited on
Commit ·
6b864bd
1
Parent(s): 190b924
Remove zero-shot segmentation snippet, link to GitHub for advanced use cases
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ pipeline_tag: zero-shot-image-classification
|
|
| 12 |
|
| 13 |
# TIPSv2 — SO400m/14
|
| 14 |
|
| 15 |
-
TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the SO400m variant with 412M vision params and 448M text params.
|
| 16 |
|
| 17 |
| Variant | Vision params | Text params | Embed dim | DPT Heads |
|
| 18 |
|---------|--------------|-------------|-----------|-----------|
|
|
@@ -78,16 +78,6 @@ similarity = cls @ text_emb.T
|
|
| 78 |
print(classes[similarity.argmax()]) # bus — predicted class
|
| 79 |
```
|
| 80 |
|
| 81 |
-
### Zero-shot segmentation
|
| 82 |
-
|
| 83 |
-
```python
|
| 84 |
-
classes = ["bus", "snow", "mountain", "house", "road"]
|
| 85 |
-
patch_feats = F.normalize(out.patch_tokens, dim=-1)
|
| 86 |
-
text_emb = F.normalize(model.encode_text(classes), dim=-1)
|
| 87 |
-
seg_map = (patch_feats @ text_emb.T).reshape(32, 32, len(classes)).argmax(dim=-1)
|
| 88 |
-
print(seg_map.shape) # (32, 32) — per-patch class prediction
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
### Visualize spatial features
|
| 92 |
|
| 93 |
```python
|
|
|
|
| 12 |
|
| 13 |
# TIPSv2 — SO400m/14
|
| 14 |
|
| 15 |
+
TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the SO400m variant with 412M vision params and 448M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.
|
| 16 |
|
| 17 |
| Variant | Vision params | Text params | Embed dim | DPT Heads |
|
| 18 |
|---------|--------------|-------------|-----------|-----------|
|
|
|
|
| 78 |
print(classes[similarity.argmax()]) # bus — predicted class
|
| 79 |
```
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
### Visualize spatial features
|
| 82 |
|
| 83 |
```python
|