google
/

tipsv2-b14

@@ -14,12 +14,12 @@ pipeline_tag: zero-shot-image-classification
 TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params.
-| Variant | Vision params | Text params | Embed dim |
-|---------|--------------|-------------|-----------|
-| B/14 | 86M | 110M | 768 |
-| L/14 | 303M | 184M | 1024 |
-| SO400m/14 | 412M | 448M | 1152 |
-| g/14 | 1.1B | 389M | 1536 |
 ## Usage

 TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params.
+| Variant | Vision params | Text params | Embed dim | DPT Heads |
+|---------|--------------|-------------|-----------|-----------|
+| [B/14](https://huggingface.co/google/tipsv2-b14) | 86M | 110M | 768 | [B/14-dpt](https://huggingface.co/google/tipsv2-b14-dpt) |
+| [L/14](https://huggingface.co/google/tipsv2-l14) | 303M | 184M | 1024 | [L/14-dpt](https://huggingface.co/google/tipsv2-l14-dpt) |
+| [SO400m/14](https://huggingface.co/google/tipsv2-so400m14) | 412M | 448M | 1152 | [SO400m/14-dpt](https://huggingface.co/google/tipsv2-so400m14-dpt) |
+| [g/14](https://huggingface.co/google/tipsv2-g14) | 1.1B | 389M | 1536 | [g/14-dpt](https://huggingface.co/google/tipsv2-g14-dpt) |
 ## Usage