gberton commited on
Commit
40a199e
·
verified ·
1 Parent(s): 4981685

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - image-text
6
+ - contrastive-learning
7
+ - zero-shot
8
+ - feature-extraction
9
+ library_name: transformers
10
+ pipeline_tag: zero-shot-image-classification
11
+ ---
12
+
13
+ # TIPSv2 — L/14
14
+
15
+ TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of
16
+ contrastive vision-language models that produce spatially rich image features
17
+ aligned with text embeddings. This is the **Large** variant with a ViT-L/14
18
+ vision encoder and a matching text encoder.
19
+
20
+ | Variant | Vision params | Text params | Embed dim |
21
+ |---------|--------------|-------------|-----------|
22
+ | B/14 | 86M | 110M | 768 |
23
+ | **L/14 (this)** | **303M** | **184M** | **1024** |
24
+ | SO400m/14 | 412M | 448M | 1152 |
25
+ | g/14 | 1.1B | 389M | 1536 |
26
+
27
+ ## Usage
28
+
29
+ ### Install dependencies
30
+
31
+ ```bash
32
+ pip install transformers torch torchvision sentencepiece
33
+ ```
34
+
35
+ ### Load the model
36
+
37
+ ```python
38
+ from transformers import AutoModel
39
+
40
+ model = AutoModel.from_pretrained("google/tipsv2-l14", trust_remote_code=True)
41
+ model.eval()
42
+ ```
43
+
44
+ ### Encode images
45
+
46
+ Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
47
+
48
+ ```python
49
+ from torchvision import transforms
50
+ from PIL import Image
51
+
52
+ transform = transforms.Compose([
53
+ transforms.Resize((448, 448)),
54
+ transforms.ToTensor(),
55
+ ])
56
+
57
+ image = transform(Image.open("photo.jpg")).unsqueeze(0)
58
+ out = model.encode_image(image)
59
+
60
+ out.cls_token # (B, 1, 1024) — global CLS feature
61
+ out.patch_tokens # (B, 1024, 1024) — spatial features (32x32 grid)
62
+ ```
63
+
64
+ ### Encode text
65
+
66
+ Pass strings directly — tokenization is handled internally.
67
+
68
+ ```python
69
+ text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
70
+ # text_emb: (2, 1024)
71
+ ```
72
+
73
+ ### Zero-shot classification
74
+
75
+ ```python
76
+ import torch.nn.functional as F
77
+
78
+ cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
79
+ text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
80
+
81
+ similarity = cls @ text_emb.T # (B, num_classes)
82
+ prediction = similarity.argmax(dim=-1)
83
+ ```
84
+
85
+ ### Zero-shot segmentation
86
+
87
+ Use spatial patch tokens for dense prediction tasks.
88
+
89
+ ```python
90
+ import numpy as np
91
+ from sklearn.decomposition import PCA
92
+
93
+ # Spatial features
94
+ spatial = out.patch_tokens.reshape(1, 32, 32, 1024) # (B, H, W, D)
95
+
96
+ # PCA visualization
97
+ feat = spatial[0].detach().numpy().reshape(-1, 1024)
98
+ rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
99
+ ```
100
+
101
+ ### GPU inference
102
+
103
+ ```python
104
+ model = model.cuda()
105
+ out = model.encode_image(image.cuda())
106
+ text_emb = model.encode_text(["a city"]) # auto-moved to model device
107
+ ```
108
+
109
+ ## Model details
110
+
111
+ - **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
112
+ - **Pre-training**: Contrastive image-text learning with spatial awareness
113
+ - **Image preprocessing**: Resize to 448x448, convert to `[0, 1]` (no normalization)
114
+ - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
115
+ - **Patch size**: 14x14 pixels
116
+ - **Register tokens**: 1
117
+ - **Position embeddings**: Interpolated at inference, supports any resolution
118
+
119
+ ## License
120
+
121
+ Apache 2.0