nebulette commited on
Commit
2f42f2e
·
verified ·
1 Parent(s): 882764d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -3
README.md CHANGED
@@ -1,3 +1,90 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - contrastive-learning
6
+ - zero-shot
7
+ - feature-extraction
8
+ library_name: transformers
9
+ pipeline_tag: zero-shot-image-classification
10
+ ---
11
+
12
+ This is a copy of [toilaluan's repo](https://huggingface.co/toilaluan/tipsv2-b14-vision), updated for the latest v5.5.4 transformers library.
13
+
14
+ TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params. Try the code snippets below or check out the [GitHub repo](https://github.com/google-deepmind/tips) for more use cases and visualizations, including zero-shot segmentation.
15
+
16
+ | Variant | Vision params | Text params | Embed dim | DPT Heads |
17
+ |---------|--------------|-------------|-----------|-----------|
18
+ | [B/14](https://huggingface.co/google/tipsv2-b14) | 86M | 110M | 768 | [B/14-dpt](https://huggingface.co/google/tipsv2-b14-dpt) |
19
+ | [L/14](https://huggingface.co/google/tipsv2-l14) | 303M | 184M | 1024 | [L/14-dpt](https://huggingface.co/google/tipsv2-l14-dpt) |
20
+ | [SO400m/14](https://huggingface.co/google/tipsv2-so400m14) | 412M | 448M | 1152 | [SO400m/14-dpt](https://huggingface.co/google/tipsv2-so400m14-dpt) |
21
+ | [g/14](https://huggingface.co/google/tipsv2-g14) | 1.1B | 389M | 1536 | [g/14-dpt](https://huggingface.co/google/tipsv2-g14-dpt) |
22
+
23
+ ### Load the model
24
+
25
+ Rename the downloaded folder to "path/to/your/project/tipsv2_b14_vision_module", keep the underscores.
26
+
27
+ ```python
28
+ from tipsv2_b14_vision_module import TIPSv2ImageModel
29
+
30
+ model = TIPSv2ImageModel.from_pretrained("google/tipsv2-b14", trust_remote_code=True)
31
+ model.eval()
32
+ ```
33
+
34
+ ### Encode images
35
+
36
+ Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
37
+
38
+ ```python
39
+ from torchvision import transforms
40
+ from PIL import Image
41
+ import requests
42
+
43
+ transform = transforms.Compose([
44
+ transforms.Resize((448, 448)),
45
+ transforms.ToTensor(),
46
+ ])
47
+
48
+ url = "https://huggingface.co/spaces/google/TIPSv2/resolve/main/examples/zeroseg/pascal_context_00049_image.png"
49
+ image = Image.open(requests.get(url, stream=True).raw)
50
+ pixel_values = transform(image).unsqueeze(0)
51
+ out = model.encode_image(pixel_values)
52
+
53
+ print(out.cls_token.shape) # (1, 1, 768) — global image embedding
54
+ print(out.patch_tokens.shape) # (1, 1024, 768) — per-patch spatial features
55
+ ```
56
+
57
+ ### Visualize spatial features
58
+
59
+ ```python
60
+ import numpy as np
61
+ from sklearn.decomposition import PCA
62
+
63
+ spatial = out.patch_tokens.reshape(1, 32, 32, 768)
64
+ feat = spatial[0].detach().cpu().numpy().reshape(-1, 768)
65
+ rgb = PCA(n_components=3, whiten=True).fit_transform(feat).reshape(32, 32, 3)
66
+ rgb = 1 / (1 + np.exp(-2.0 * rgb)) # sigmoid for [0, 1] range with good contrast
67
+ print(rgb.shape) # (32, 32, 3) — PCA of patch features as RGB
68
+ ```
69
+
70
+ ## Model details
71
+
72
+ - **Architecture**: ViT vision encoder (12 layers) + Transformer text encoder (12 layers)
73
+ - **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
74
+ - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
75
+ - **Patch size**: 14x14 pixels
76
+
77
+ ## License
78
+
79
+ Apache 2.0
80
+
81
+ ## Citation
82
+
83
+ ```bibtex
84
+ @inproceedings{cao2026tipsv2,
85
+ title = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
86
+ author = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
87
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
88
+ year = {2026}
89
+ }
90
+ ```