gberton commited on
Commit
9513273
·
verified ·
1 Parent(s): 105fb64

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - image-text
6
+ - contrastive-learning
7
+ - zero-shot
8
+ - feature-extraction
9
+ library_name: transformers
10
+ pipeline_tag: zero-shot-image-classification
11
+ ---
12
+
13
+ # TIPSv2 — B/14
14
+
15
+ TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Base variant with 86M vision params and 110M text params.
16
+
17
+ | Variant | Vision params | Text params | Embed dim |
18
+ |---------|--------------|-------------|-----------|
19
+ | B/14 | 86M | 110M | 768 |
20
+ | L/14 | 303M | 184M | 1024 |
21
+ | SO400m/14 | 412M | 448M | 1152 |
22
+ | g/14 | 1.1B | 389M | 1536 |
23
+
24
+ ## Usage
25
+
26
+ ```bash
27
+ pip install transformers torch torchvision sentencepiece
28
+ ```
29
+
30
+ ### Load the model
31
+
32
+ ```python
33
+ from transformers import AutoModel
34
+
35
+ model = AutoModel.from_pretrained("google/tipsv2-b14", trust_remote_code=True)
36
+ model.eval()
37
+ ```
38
+
39
+ ### Encode images
40
+
41
+ Images should be tensors in `[0, 1]` range (just `ToTensor()`, no ImageNet normalization).
42
+
43
+ ```python
44
+ from torchvision import transforms
45
+ from PIL import Image
46
+
47
+ transform = transforms.Compose([
48
+ transforms.Resize((448, 448)),
49
+ transforms.ToTensor(),
50
+ ])
51
+
52
+ image = transform(Image.open("photo.jpg")).unsqueeze(0)
53
+ out = model.encode_image(image)
54
+
55
+ out.cls_token # (B, 1, 768)
56
+ out.patch_tokens # (B, N, 768)
57
+ ```
58
+
59
+ ### Encode text
60
+
61
+ ```python
62
+ text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
63
+ # (2, 768)
64
+ ```
65
+
66
+ ### Zero-shot classification
67
+
68
+ ```python
69
+ import torch.nn.functional as F
70
+
71
+ cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
72
+ text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
73
+ similarity = cls @ text_emb.T
74
+ prediction = similarity.argmax(dim=-1)
75
+ ```
76
+
77
+ ### GPU inference
78
+
79
+ ```python
80
+ model = model.cuda()
81
+ out = model.encode_image(image.cuda())
82
+ text_emb = model.encode_text(["a city"])
83
+ ```
84
+
85
+ ## Model details
86
+
87
+ - **Architecture**: ViT vision encoder (12 layers) + Transformer text encoder (12 layers)
88
+ - **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
89
+ - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
90
+ - **Patch size**: 14x14 pixels
91
+
92
+ ## License
93
+
94
+ Apache 2.0