gberton commited on
Commit
fda36e6
·
verified ·
1 Parent(s): 5b2721b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -24
README.md CHANGED
@@ -12,22 +12,17 @@ pipeline_tag: zero-shot-image-classification
12
 
13
  # TIPSv2 — L/14
14
 
15
- TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of
16
- contrastive vision-language models that produce spatially rich image features
17
- aligned with text embeddings. This is the **Large** variant with a ViT-L/14
18
- vision encoder and a matching text encoder.
19
 
20
  | Variant | Vision params | Text params | Embed dim |
21
  |---------|--------------|-------------|-----------|
22
  | B/14 | 86M | 110M | 768 |
23
- | **L/14 (this)** | **303M** | **184M** | **1024** |
24
  | SO400m/14 | 412M | 448M | 1152 |
25
  | g/14 | 1.1B | 389M | 1536 |
26
 
27
  ## Usage
28
 
29
- ### Install dependencies
30
-
31
  ```bash
32
  pip install transformers torch torchvision sentencepiece
33
  ```
@@ -57,17 +52,15 @@ transform = transforms.Compose([
57
  image = transform(Image.open("photo.jpg")).unsqueeze(0)
58
  out = model.encode_image(image)
59
 
60
- out.cls_token # (B, 1, 1024) — global CLS feature
61
- out.patch_tokens # (B, 1024, 1024) — spatial features (32x32 grid)
62
  ```
63
 
64
  ### Encode text
65
 
66
- Pass strings directly — tokenization is handled internally.
67
-
68
  ```python
69
  text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
70
- # text_emb: (2, 1024)
71
  ```
72
 
73
  ### Zero-shot classification
@@ -78,22 +71,17 @@ import torch.nn.functional as F
78
  cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
79
  text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
80
 
81
- similarity = cls @ text_emb.T # (B, num_classes)
82
  prediction = similarity.argmax(dim=-1)
83
  ```
84
 
85
  ### Zero-shot segmentation
86
 
87
- Use spatial patch tokens for dense prediction tasks.
88
-
89
  ```python
90
  import numpy as np
91
  from sklearn.decomposition import PCA
92
 
93
- # Spatial features
94
- spatial = out.patch_tokens.reshape(1, 32, 32, 1024) # (B, H, W, D)
95
-
96
- # PCA visualization
97
  feat = spatial[0].detach().numpy().reshape(-1, 1024)
98
  rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
99
  ```
@@ -103,18 +91,15 @@ rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
103
  ```python
104
  model = model.cuda()
105
  out = model.encode_image(image.cuda())
106
- text_emb = model.encode_text(["a city"]) # auto-moved to model device
107
  ```
108
 
109
  ## Model details
110
 
111
  - **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
112
- - **Pre-training**: Contrastive image-text learning with spatial awareness
113
- - **Image preprocessing**: Resize to 448x448, convert to `[0, 1]` (no normalization)
114
  - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
115
  - **Patch size**: 14x14 pixels
116
- - **Register tokens**: 1
117
- - **Position embeddings**: Interpolated at inference, supports any resolution
118
 
119
  ## License
120
 
 
12
 
13
  # TIPSv2 — L/14
14
 
15
+ TIPSv2 (Text-Image Pre-training with Spatial awareness) is a family of contrastive vision-language models that produce spatially rich image features aligned with text embeddings. This is the Large variant with a ViT-L/14 vision encoder and a matching text encoder.
 
 
 
16
 
17
  | Variant | Vision params | Text params | Embed dim |
18
  |---------|--------------|-------------|-----------|
19
  | B/14 | 86M | 110M | 768 |
20
+ | L/14 | 303M | 184M | 1024 |
21
  | SO400m/14 | 412M | 448M | 1152 |
22
  | g/14 | 1.1B | 389M | 1536 |
23
 
24
  ## Usage
25
 
 
 
26
  ```bash
27
  pip install transformers torch torchvision sentencepiece
28
  ```
 
52
  image = transform(Image.open("photo.jpg")).unsqueeze(0)
53
  out = model.encode_image(image)
54
 
55
+ out.cls_token # (B, 1, 1024)
56
+ out.patch_tokens # (B, 1024, 1024)
57
  ```
58
 
59
  ### Encode text
60
 
 
 
61
  ```python
62
  text_emb = model.encode_text(["a photo of a cat", "a photo of a dog"])
63
+ # (2, 1024)
64
  ```
65
 
66
  ### Zero-shot classification
 
71
  cls = F.normalize(out.cls_token[:, 0, :], dim=-1)
72
  text_emb = F.normalize(model.encode_text(["cat", "dog", "car"]), dim=-1)
73
 
74
+ similarity = cls @ text_emb.T
75
  prediction = similarity.argmax(dim=-1)
76
  ```
77
 
78
  ### Zero-shot segmentation
79
 
 
 
80
  ```python
81
  import numpy as np
82
  from sklearn.decomposition import PCA
83
 
84
+ spatial = out.patch_tokens.reshape(1, 32, 32, 1024)
 
 
 
85
  feat = spatial[0].detach().numpy().reshape(-1, 1024)
86
  rgb = PCA(n_components=3).fit_transform(feat).reshape(32, 32, 3)
87
  ```
 
91
  ```python
92
  model = model.cuda()
93
  out = model.encode_image(image.cuda())
94
+ text_emb = model.encode_text(["a city"])
95
  ```
96
 
97
  ## Model details
98
 
99
  - **Architecture**: ViT-L/14 vision encoder (24 layers) + Transformer text encoder (12 layers)
100
+ - **Image preprocessing**: resize to any resolution, convert to `[0, 1]` (no ImageNet normalization)
 
101
  - **Text preprocessing**: SentencePiece tokenizer, lowercased, max 64 tokens
102
  - **Patch size**: 14x14 pixels
 
 
103
 
104
  ## License
105