kimyoungjune commited on
Commit
97e3347
·
verified ·
1 Parent(s): 1ebf474

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -18
README.md CHANGED
@@ -1,8 +1,6 @@
1
- ## About `GME-VARCO-VISION-Embedding`
2
  `GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.
3
 
4
- <br>
5
-
6
  ## Demo Video
7
  Check out our demo videos showcasing our multimodal embedding model in action:
8
  - [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
@@ -10,8 +8,6 @@ Check out our demo videos showcasing our multimodal embedding model in action:
10
 
11
  The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
12
 
13
- <br>
14
-
15
  ### Model Architecture and Training Method
16
  `GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.
17
 
@@ -22,11 +18,11 @@ as hard-negatives for the gold answers. . Each query is trained with in-batch ne
22
  #### 2. Adding Retrieval Vector
23
  To compensate for the insufficiency of training instances and enhance the generalization ability of the fine-tuned model, we compute a retrieval vector 𝜏 by subtracting the weights of the original [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) model from those of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct), a Qwen2-VL based image-text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre-trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat-optimized counterpart.
24
 
25
- <br>
26
 
27
  ### Performance
28
  Our model achieves **state-of-the-art (SOTA) zero-shot performance** on the MultiVENT2.0 dataset as of July 2025. See the [official leaderboard](https://eval.ai/web/challenges/challenge-page/2507/leaderboard/6262) for detailed results.
29
 
 
30
  <br>
31
 
32
  ## Code Examples
@@ -49,14 +45,7 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(
49
  device_map="auto",
50
  )
51
 
52
- tokenizer = AutoTokenizer.from_pretrained(
53
- model_name,
54
- padding_side="left",
55
- )
56
-
57
-
58
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
59
- processor.tokenizer = tokenizer
60
  device = model.device
61
 
62
 
@@ -77,7 +66,7 @@ qry_input = processor(
77
  text=[qry_txt],
78
  padding=True,
79
  return_tensors="pt",
80
- ).to("cuda")
81
 
82
 
83
  img_msg = [
@@ -134,7 +123,7 @@ image_inputs = processor(
134
  # videos=,
135
  padding=True,
136
  return_tensors="pt",
137
- ).to("cuda")
138
 
139
  with torch.inference_mode():
140
  qry_emb = model(
@@ -149,7 +138,7 @@ qry_emb = F.normalize(qry_emb, dim=-1)
149
  img_emb = F.normalize(img_emb, dim=-1)
150
 
151
  score = qry_emb @ img_emb.t()
152
- # tensor([[0.2432, 0.0962, 0.0747, 0.0898]], device='cuda:0', dtype=torch.bfloat16)
153
  # corresponding to the score of photos (cat, dog, baseball, crowd)
154
  ```
155
  <br>
@@ -179,7 +168,7 @@ video_input = processor(
179
  videos=video_input,
180
  padding=True,
181
  return_tensors="pt",
182
- ).to("cuda")
183
 
184
  with torch.inference_mode():
185
  video_emb = model(
 
1
+ ## About GME-VARCO-VISION-Embedding
2
  `GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.
3
 
 
 
4
  ## Demo Video
5
  Check out our demo videos showcasing our multimodal embedding model in action:
6
  - [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
 
8
 
9
  The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
10
 
 
 
11
  ### Model Architecture and Training Method
12
  `GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.
13
 
 
18
  #### 2. Adding Retrieval Vector
19
  To compensate for the insufficiency of training instances and enhance the generalization ability of the fine-tuned model, we compute a retrieval vector 𝜏 by subtracting the weights of the original [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) model from those of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct), a Qwen2-VL based image-text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre-trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat-optimized counterpart.
20
 
 
21
 
22
  ### Performance
23
  Our model achieves **state-of-the-art (SOTA) zero-shot performance** on the MultiVENT2.0 dataset as of July 2025. See the [official leaderboard](https://eval.ai/web/challenges/challenge-page/2507/leaderboard/6262) for detailed results.
24
 
25
+
26
  <br>
27
 
28
  ## Code Examples
 
45
  device_map="auto",
46
  )
47
 
48
+ processor = AutoProcessor.from_pretrained(model_name)
 
 
 
 
 
 
 
49
  device = model.device
50
 
51
 
 
66
  text=[qry_txt],
67
  padding=True,
68
  return_tensors="pt",
69
+ ).to(device)
70
 
71
 
72
  img_msg = [
 
123
  # videos=,
124
  padding=True,
125
  return_tensors="pt",
126
+ ).to(device)
127
 
128
  with torch.inference_mode():
129
  qry_emb = model(
 
138
  img_emb = F.normalize(img_emb, dim=-1)
139
 
140
  score = qry_emb @ img_emb.t()
141
+ # tensor([[0.3066, 0.1108, 0.1226, 0.1245]], device='cuda:0', dtype=torch.bfloat16)
142
  # corresponding to the score of photos (cat, dog, baseball, crowd)
143
  ```
144
  <br>
 
168
  videos=video_input,
169
  padding=True,
170
  return_tensors="pt",
171
+ ).to(device)
172
 
173
  with torch.inference_mode():
174
  video_emb = model(