google
/

tipsv2-l14

@@ -12,22 +12,17 @@ pipeline_tag: zero-shot-image-classification
 # TIPSv2 — L/14
-TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of
-contrastive vision-language models that produce spatially rich image features
-aligned with text embeddings. This is the **Large** variant with a ViT-L/14
-vision encoder and a matching text encoder.
 | Variant | Vision params | Text params | Embed dim |
 |---------|--------------|-------------|-----------|
 | B/14 | 86M | 110M | 768 |
-| **L/14 (this)** | **303M** | **184M** | **1024** |
 | SO400m/14 | 412M | 448M | 1152 |
 | g/14 | 1.1B | 389M | 1536 |
 ## Usage
-### Install dependencies
 ```bash
 pip install transformers torch torchvision sentencepiece
 ```
@@ -57,17 +52,15 @@ transform = transforms.Compose([
 image = transform(Image.open("photo.jpg")).unsqueeze(0)
 out = model.encode_image(image)
-out.cls_token      # (B, 1, 1024) — global CLS feature
-out.patch_tokens   # (B, 1024, 1024) — spatial features (32x32 grid)
 ```
 ### Encode text
-Pass strings directly — tokenization is handled internally.
 ```python
 text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
-# text_emb: (2, 1024)
 ```
 ### Zero-shot classification
@@ -78,22 +71,17 @@ import torch.nn.functional as F
 cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
 text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
-similarity = cls @ text_emb.T  # (B, num_classes)
 prediction = similarity.argmax(dim=-1)
 ```
 ### Zero-shot segmentation
-Use spatial patch tokens for dense prediction tasks.
 ```python
 import numpy as np
 from sklearn.decomposition import PCA
-# Spatial features
-spatial = out.patch_tokens.reshape(1, 32, 32, 1024)  # (B, H, W, D)
-# PCA visualization
 feat = spatial[0].detach().numpy().reshape(-1, 1024)
 rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
 ```
@@ -103,18 +91,15 @@ rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
 ```python
 model = model.cuda()
 out = model.encode_image(image.cuda())
-text_emb = model.encode_text(["a city"])  # auto-moved to model device
 ```
 ## Model details
 - **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
-- **Pre-training**: Contrastive image-text learning with spatial awareness
-- **Image preprocessing**: Resize to 448x448, convert to `[0, 1]` (no normalization)
 - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
 - **Patch size**: 14x14 pixels
-- **Register tokens**: 1
-- **Position embeddings**: Interpolated at inference, supports any resolution
 ## License

 # TIPSv2 — L/14
+TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Large variant with a ViT-L/14 vision encoder and a matching text encoder.
 | Variant | Vision params | Text params | Embed dim |
 |---------|--------------|-------------|-----------|
 | B/14 | 86M | 110M | 768 |
+| L/14 | 303M | 184M | 1024 |
 | SO400m/14 | 412M | 448M | 1152 |
 | g/14 | 1.1B | 389M | 1536 |
 ## Usage
 ```bash
 pip install transformers torch torchvision sentencepiece
 ```
 image = transform(Image.open("photo.jpg")).unsqueeze(0)
 out = model.encode_image(image)
+out.cls_token      # (B, 1, 1024)
+out.patch_tokens   # (B, 1024, 1024)
 ```
 ### Encode text
 ```python
 text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
+# (2, 1024)
 ```
 ### Zero-shot classification
 cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
 text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
+similarity = cls @ text_emb.T
 prediction = similarity.argmax(dim=-1)
 ```
 ### Zero-shot segmentation
 ```python
 import numpy as np
 from sklearn.decomposition import PCA
+spatial = out.patch_tokens.reshape(1, 32, 32, 1024)
 feat = spatial[0].detach().numpy().reshape(-1, 1024)
 rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
 ```
 ```python
 model = model.cuda()
 out = model.encode_image(image.cuda())
+text_emb = model.encode_text(["a city"])
 ```
 ## Model details
 - **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
+- **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
 - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
 - **Patch size**: 14x14 pixels
 ## License