Improve FG-CLIP 2 model card: Add Chinese language, project page, and enhanced sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +31 -21
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  language:
3
  - en
 
4
  library_name: transformers
5
  license: apache-2.0
6
  pipeline_tag: zero-shot-image-classification
@@ -9,14 +10,16 @@ tags:
9
  ---
10
 
11
  # FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
 
12
  Code: https://github.com/360CVGroup/FG-CLIP
 
13
 
14
  FG-CLIP 2 is the foundation model for fine-grained vision-language understanding in both English and Chinese.
15
  Across 29 datasets and 8 diverse tasks, it consistently surpasses recent strong baselines such as SigLIP 2 and MetaCLIP 2, achieving the best reported performance to date in both languages.
16
 
17
  **[FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model](https://arxiv.org/abs/2510.10921)**
18
  </br>
19
- Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin(*Equal Contribution, Corresponding Author)
20
  </br>
21
  [![arXiv](https://img.shields.io/badge/arXiv-2510.10921-b31b1b.svg)](https://arxiv.org/abs/2510.10921)
22
  [![HF-model](https://img.shields.io/badge/Model-Collection🤗-yellow.svg)](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
@@ -25,18 +28,22 @@ Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Len
25
 
26
  **[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)** ([code branch: v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
27
  </br>
28
- Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, Corresponding Author)
29
  </br>
30
  [![arXiv](https://img.shields.io/badge/arXiv-2505.05071-b31b1b.svg)](https://arxiv.org/abs/2505.05071)
31
  [![ICML](https://img.shields.io/badge/ICML-2025-blue.svg)](https://icml.cc/Conferences/2025)
32
  [![HF-model](https://img.shields.io/badge/Model-Collection🤗-yellow.svg)](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
33
  [![HF-data](https://img.shields.io/badge/Data-FineHARD🤗-yellow.svg)](https://huggingface.co/datasets/qihoo360/FineHARD)
34
- [![DeepWiki](https://img.shields.io/badge/DeepWiki-FG--CLIP-blue.svg?logo=)](https://deepwiki.com/360CVGroup/FG-CLIP)
 
 
 
 
35
 
36
  ## Quick Start 🤗
37
 
38
  ### Load Model
39
- ```Shell
40
  import torch
41
  from PIL import Image
42
  from transformers import (
@@ -59,12 +66,10 @@ image_processor = AutoImageProcessor.from_pretrained(model_root)
59
 
60
  ### Retrieval
61
 
62
- ```Shell
63
  def determine_max_value(image):
64
-
65
  w,h = image.size
66
  max_val = (w//16)*(h//16)
67
-
68
  if max_val > 784:
69
  return 1024
70
  elif max_val > 576:
@@ -81,32 +86,38 @@ image = Image.open(img_root).convert("RGB")
81
 
82
  image_input = image_processor(images=image, max_num_patches=determine_max_value(image), return_tensors="pt").to(device)
83
 
84
- # NOTE Short captions: max_length=64
 
85
 
86
- captions = ["a photo of two cats", "a photo of a cat"]
 
 
 
 
 
87
  captions = [caption.lower() for caption in captions]
88
 
89
- caption_input = tokenizer(captions, padding="max_length", max_length=64, truncation=True, return_tensors="pt").to(device)
90
 
91
 
92
  with torch.no_grad():
93
  image_feature = model.get_image_features(**image_input)
94
- text_feature = model.get_text_features(**caption_input)
95
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
96
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
97
 
98
  logits_per_image = image_feature @ text_feature.T
99
  logit_scale, logit_bias = model.logit_scale.to(text_feature.device), model.logit_bias.to(text_feature.device)
100
  logits_per_image = logits_per_image * logit_scale.exp() + logit_bias
101
- probs = torch.sigmoid(logits_per_image)
102
- # [[0.5322, 0.0048]]
103
- print(probs)
104
-
105
  ```
 
 
 
106
 
107
  ### Dense feature effect display
108
 
109
- ```Shell
110
 
111
  import math
112
  import matplotlib
@@ -116,7 +127,9 @@ import matplotlib.pyplot as plt
116
 
117
  img_root = "cat_dfclor.jpg"
118
  image = Image.open(img_root).convert("RGB")
119
- image = resize_short_edge(image,target_size=2048)
 
 
120
 
121
  image_input = image_processor(images=image, max_num_patches=16384, return_tensors="pt").to(device)
122
  captions = ["电脑","黑猫","窗户","window","white cat","book"]
@@ -129,8 +142,6 @@ with torch.no_grad():
129
  real_w = spatial_values[1].item()
130
  real_pixel_tokens_num = real_w*real_h
131
  dense_image_feature = dense_image_feature[0][:real_pixel_tokens_num]
132
-
133
-
134
  captions = [caption.lower() for caption in captions]
135
  caption_input = tokenizer(captions, padding="max_length", max_length=64, truncation=True, return_tensors="pt").to(device)
136
 
@@ -180,7 +191,7 @@ plt.close()
180
  ```
181
 
182
  <p align="left">
183
- <img src="FGCLIP2_dfcolor_cat_all_2K.png" width=50%/>
184
  </p>
185
 
186
  ## Citation
@@ -204,7 +215,6 @@ If you find FG-CLIP 2 useful for your research and applications, please cite usi
204
  ```
205
 
206
 
207
-
208
  ## License
209
 
210
  This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
 
1
  ---
2
  language:
3
  - en
4
+ - zh
5
  library_name: transformers
6
  license: apache-2.0
7
  pipeline_tag: zero-shot-image-classification
 
10
  ---
11
 
12
  # FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
13
+
14
  Code: https://github.com/360CVGroup/FG-CLIP
15
+ Project page: https://360cvgroup.github.io/FG-CLIP
16
 
17
  FG-CLIP 2 is the foundation model for fine-grained vision-language understanding in both English and Chinese.
18
  Across 29 datasets and 8 diverse tasks, it consistently surpasses recent strong baselines such as SigLIP 2 and MetaCLIP 2, achieving the best reported performance to date in both languages.
19
 
20
  **[FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model](https://arxiv.org/abs/2510.10921)**
21
  </br>
22
+ Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin(*Equal Contribution, Corresponding Author)
23
  </br>
24
  [![arXiv](https://img.shields.io/badge/arXiv-2510.10921-b31b1b.svg)](https://arxiv.org/abs/2510.10921)
25
  [![HF-model](https://img.shields.io/badge/Model-Collection🤗-yellow.svg)](https://huggingface.co/collections/qihoo360/fg-clip-2-68ecbf9c548623bb78bc7913)
 
28
 
29
  **[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)** ([code branch: v1.0](https://github.com/360CVGroup/FG-CLIP/tree/v1.0))
30
  </br>
31
+ Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, Corresponding Author)
32
  </br>
33
  [![arXiv](https://img.shields.io/badge/arXiv-2505.05071-b31b1b.svg)](https://arxiv.org/abs/2505.05071)
34
  [![ICML](https://img.shields.io/badge/ICML-2025-blue.svg)](https://icml.cc/Conferences/2025)
35
  [![HF-model](https://img.shields.io/badge/Model-Collection🤗-yellow.svg)](https://huggingface.co/collections/qihoo360/fg-clip-681da45d4acfb65c240a6d08)
36
  [![HF-data](https://img.shields.io/badge/Data-FineHARD🤗-yellow.svg)](https://huggingface.co/datasets/qihoo360/FineHARD)
37
+ [![DeepWiki](https://img.shields.io/badge/DeepWiki-FG--CLIP-blue.svg?logo=)](https://deepwiki.com/360CVGroup/FG-CLIP)
38
+
39
+ <p align="center">
40
+ <img src="https://huggingface.co/qihoo360/fg-clip2-base/resolve/main/use_imgs/FGCLIP2_compare_all_n.png" width="500" height="440"/>
41
+ </p>
42
 
43
  ## Quick Start 🤗
44
 
45
  ### Load Model
46
+ ```python
47
  import torch
48
  from PIL import Image
49
  from transformers import (
 
66
 
67
  ### Retrieval
68
 
69
+ ```python
70
  def determine_max_value(image):
 
71
  w,h = image.size
72
  max_val = (w//16)*(h//16)
 
73
  if max_val > 784:
74
  return 1024
75
  elif max_val > 576:
 
86
 
87
  image_input = image_processor(images=image, max_num_patches=determine_max_value(image), return_tensors="pt").to(device)
88
 
89
+ # NOTE Short captions: max_length=64 walk_type="short"(default)
90
+ # NOTE Long captions: max_length=196 walk_type="long"
91
 
92
+ captions = [
93
+ "一个简约风格的卧室角落,黑色金属衣架上挂着多件米色和白色的衣物,下方架子放着两双浅色鞋子,旁边是一盆绿植,左侧可见一张铺有白色床单和灰色枕头的床。",
94
+ "一个简约风格的卧室角落,黑色金属衣架上挂着多件红色和蓝色的衣物,下方架子放着两双黑色高跟鞋,旁边是一盆绿植,左侧可见一张铺有白色床单和灰色枕头的床。",
95
+ "一个简约风格的卧室角落,黑色金属衣架上挂着多件米色和白色的衣物,下方架子放着两双运动鞋,旁边是一盆仙人掌,左侧可见一张铺有白色床单和灰色枕头的床。",
96
+ "一个繁忙的街头市场,摊位上摆满水果,背景是高楼大厦,人们在喧闹中购物。"
97
+ ]
98
  captions = [caption.lower() for caption in captions]
99
 
100
+ caption_input = tokenizer(captions, padding="max_length", max_length=196, truncation=True, return_tensors="pt").to(device)
101
 
102
 
103
  with torch.no_grad():
104
  image_feature = model.get_image_features(**image_input)
105
+ text_feature = model.get_text_features(**caption_input,walk_type="long")
106
  image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
107
  text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
108
 
109
  logits_per_image = image_feature @ text_feature.T
110
  logit_scale, logit_bias = model.logit_scale.to(text_feature.device), model.logit_bias.to(text_feature.device)
111
  logits_per_image = logits_per_image * logit_scale.exp() + logit_bias
112
+ # The original Github example does not print probabilities for retrieval, keeping consistency.
 
 
 
113
  ```
114
+ <p align="left">
115
+ <img src="https://huggingface.co/qihoo360/fg-clip2-base/resolve/main/use_imgs/cn_re_demo.png" width=100%/>
116
+ </p>
117
 
118
  ### Dense feature effect display
119
 
120
+ ```python
121
 
122
  import math
123
  import matplotlib
 
127
 
128
  img_root = "cat_dfclor.jpg"
129
  image = Image.open(img_root).convert("RGB")
130
+ # The 'resize_short_edge' function is not defined in the snippet or provided context.
131
+ # Assuming 'cat_dfclor.jpg' is pre-processed or the model handles sizing.
132
+ # image = resize_short_edge(image,target_size=2048)
133
 
134
  image_input = image_processor(images=image, max_num_patches=16384, return_tensors="pt").to(device)
135
  captions = ["电脑","黑猫","窗户","window","white cat","book"]
 
142
  real_w = spatial_values[1].item()
143
  real_pixel_tokens_num = real_w*real_h
144
  dense_image_feature = dense_image_feature[0][:real_pixel_tokens_num]
 
 
145
  captions = [caption.lower() for caption in captions]
146
  caption_input = tokenizer(captions, padding="max_length", max_length=64, truncation=True, return_tensors="pt").to(device)
147
 
 
191
  ```
192
 
193
  <p align="left">
194
+ <img src="https://huggingface.co/qihoo360/fg-clip2-base/resolve/main/use_imgs/FGCLIP2_dfcolor_cat_all_2K.png" width=100%/>
195
  </p>
196
 
197
  ## Citation
 
215
  ```
216
 
217
 
 
218
  ## License
219
 
220
  This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.