wjpoom commited on
Commit
c17c3e4
·
verified ·
1 Parent(s): f0bfeae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-matching
4
+ model-index:
5
+ - name: SPEC-CLIP-ViT-B-32
6
+ results:
7
+ - task:
8
+ type: image-text-matching
9
+ dataset:
10
+ name: SPEC
11
+ type: compositional-reasoning
12
+ metrics:
13
+ - name: Absolute Size I2T
14
+ type: Image to Text Matching
15
+ value: 68.9
16
+ - name: Absolute Size T2I
17
+ type: Image to Text Matching
18
+ value: 60.7
19
+ - name: Relative Size I2T
20
+ type: Image to Text Matching
21
+ value: 40.3
22
+ - name: Relative Size T2I
23
+ type: Image to Text Matching
24
+ value: 44.1
25
+ - name: Absolute Position I2T
26
+ type: Image to Text Matching
27
+ value: 30.6
28
+ - name: Absolute Position T2I
29
+ type: Image to Text Matching
30
+ value: 34.2
31
+ - name: Relative Position I2T
32
+ type: Image to Text Matching
33
+ value: 46.6
34
+ - name: Relative Position T2I
35
+ type: Image to Text Matching
36
+ value: 46.9
37
+ - name: Existence I2T
38
+ type: Image to Text Matching
39
+ value: 83.4
40
+ - name: Existence T2I
41
+ type: Image to Text Matching
42
+ value: 53.1
43
+ - name: Count I2T
44
+ type: Image to Text Matching
45
+ value: 55.6
46
+ - name: Count T2I
47
+ type: Image to Text Matching
48
+ value: 57.8
49
+ source:
50
+ name: SPEC paper
51
+ url: https://arxiv.org/pdf/2312.00081
52
+ ---
53
+ # SPEC-CLIP-ViT-B-32
54
+
55
+ ### Model Sources
56
+ [**Code**](https://github.com/wjpoom/SPEC) | [**Paper**](https://huggingface.co/papers/2312.00081) | [**arXiv**](https://arxiv.org/abs/2312.00081)
57
+
58
+ ### Model Usage
59
+ * download checkpoint
60
+ ```shell
61
+ huggingface-cli download wjpoom/SPEC-CLIP-ViT-B-32 --local-dir checkpoints/SPEC-CLIP-ViT-B-32
62
+ ```
63
+
64
+ * load model
65
+ ```python
66
+ # pip install open_clip_torch
67
+ import torch
68
+ from PIL import Image
69
+ import open_clip
70
+
71
+ model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='checkpoints/SPEC-CLIP-ViT-B-32', load_weights_only=False)
72
+ model.eval() # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
73
+ tokenizer = open_clip.get_tokenizer('ViT-B-32')
74
+
75
+ image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
76
+ text = tokenizer(["a diagram", "a dog", "a cat"])
77
+
78
+ with torch.no_grad(), torch.autocast("cuda"):
79
+ image_features = model.encode_image(image)
80
+ text_features = model.encode_text(text)
81
+ image_features /= image_features.norm(dim=-1, keepdim=True)
82
+ text_features /= text_features.norm(dim=-1, keepdim=True)
83
+
84
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
85
+
86
+ print("Label probs:", text_probs) # prints: [[1., 0., 0.]]
87
+ ```
88
+
89
+ ## Contact
90
+ Feel free to contact us if you have any questions or suggestions
91
+ - Email (Wujian Peng): wjpeng24@m.fudan.edu.cn
92
+
93
+ ## Citation
94
+ ``` bibtex
95
+ @inproceedings{peng2024synthesize,
96
+ title={Synthesize diagnose and optimize: Towards fine-grained vision-language understanding},
97
+ author={Peng, Wujian and Xie, Sicheng and You, Zuyao and Lan, Shiyi and Wu, Zuxuan},
98
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
99
+ pages={13279--13288},
100
+ year={2024}
101
+ }
102
+ ```