kimyoungjune commited on
Commit
ffeebf7
·
verified ·
1 Parent(s): 0e7b29e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -7
README.md CHANGED
@@ -1,3 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## Demo Video
2
  Check out our demo videos showcasing our multimodal embedding model in action:
3
  - [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
@@ -5,17 +26,172 @@ Check out our demo videos showcasing our multimodal embedding model in action:
5
 
6
  The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- ## Performance
10
- Our model achieves **state-of-the-art (SOTA) performance** on the MultiVENT2.0 dataset in zero-shot settings. See the [official leaderboard](https://eval.ai/web/challenges/challenge-page/2507/leaderboard/6262) for detailed results.
 
 
 
11
 
12
 
 
 
 
 
 
 
 
 
 
13
 
14
- ## Release Schedule
15
- **Coming Soon**
16
- - Model checkpoints and agent prompts will be released shortly
17
- - Research paper will be published soon
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ---
20
  license: cc-by-nc-4.0
21
- ---
 
1
+ ## About `GME-VARCO-VISION-Embedding`
2
+ `GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.
3
+
4
+
5
+ ### Model Architecture and Training Method
6
+ `GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.
7
+
8
+ #### 1. Fine-tuning (Contrastive Learning) on video preference dataset
9
+ To efficiently fine-tune the model, we utilize [ShareGPTVideo’s 17𝑘 video preference dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction), which includes prompts, videos, gold answers, and chosen-rejected text pairs. We treat the prompts and videos as queries, and the rejected responses
10
+ as hard-negatives for the gold answers. . Each query is trained with in-batch negatives as well as one hard negative using the InfoNCE loss. The model is fully fine-tuned for two epochs on 8 A100 GPUs with a batch size of 8, requiring only a few hours for training.
11
+
12
+ #### 2. Adding Retrieval Vector
13
+ To compensate for the insufficiency of training instances and enhance the generalization ability of the fine-tuned model, we compute a retrieval vector 𝜏 by subtracting the weights of the original [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) model from those of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct), a Qwen2-VL based image-text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre-trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat-optimized counterpart.
14
+
15
+ <br>
16
+
17
+ ### Performance
18
+ Our model achieves **state-of-the-art (SOTA) zero-shot performance** on the MultiVENT2.0 dataset as of July 2025. See the [official leaderboard](https://eval.ai/web/challenges/challenge-page/2507/leaderboard/6262) for detailed results.
19
+
20
+ <br>
21
+
22
  ## Demo Video
23
  Check out our demo videos showcasing our multimodal embedding model in action:
24
  - [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
 
26
 
27
  The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
28
 
29
+ <br>
30
+
31
+ ## Code Examples
32
+ `GME-VARCO-VISION-Embedding` adopts the inference pipeline of [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
33
+
34
+ ### Image-Text Retrieval
35
+
36
+ ```python
37
+ import torch
38
+ import requests
39
+ from PIL import Image
40
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
41
+
42
+ model_name = "NCSOFT/GME-VARCO-VISION-Embedding"
43
+
44
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
45
+ model_name,
46
+ torch_dtype=torch.bfloat16,
47
+ attn_implementation="flash_attention_2",
48
+ device_map="auto",
49
+ )
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained(
52
+ model_name,
53
+ padding_side="left",
54
+ )
55
+
56
+
57
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
58
+ processor.tokenizer = tokenizer
59
+ device = model.device
60
+
61
+
62
+ qry_msg = [
63
+ {
64
+ "role": "user",
65
+ "content": [
66
+ {"type": "text", "text": "Find a photo of a cat."},
67
+ ],
68
+ },
69
+ ]
70
+
71
+ qry_txt = processor.apply_chat_template(
72
+ qry_msg, tokenize=False, add_generation_prompt=True
73
+ ) + tokenizer.eos_token
74
 
75
+ qry_input = processor(
76
+ text=[qry_txt],
77
+ padding=True,
78
+ return_tensors="pt",
79
+ ).to("cuda")
80
 
81
 
82
+ img_msg = [
83
+ {
84
+ "role": "user",
85
+ "content": [{
86
+ "type": "image",
87
+ "image": "image"
88
+ }]
89
+ }
90
+ ]
91
 
92
+ img_txt = processor.apply_chat_template(
93
+ img_msg, tokenize=False, add_generation_prompt=True
94
+ ) + tokenizer.eos_token
95
+
96
+
97
+ candidate_imgs= [
98
+ # Photo of two cats
99
+ {
100
+ "role": "user",
101
+ "content": [{
102
+ "type": "image",
103
+ "image": "http://images.cocodataset.org/val2017/000000039769.jpg"}]
104
+ },
105
+ # Photo of two dogs
106
+ {
107
+ "role": "user",
108
+ "content": [{
109
+ "type": "image",
110
+ "image": "https://farm1.staticflickr.com/116/290755713_a5de6c1079_z.jpg"}]
111
+ },
112
+ # photo of two people playing baseball
113
+ {
114
+ "role": "user",
115
+ "content": [{
116
+ "type": "image",
117
+ "image": "http://farm3.staticflickr.com/2418/2193688811_d9f5e23bbd_z.jpg"}]
118
+ },
119
+ # Photo of a large crowd in a busy city street
120
+ {
121
+ "role": "user",
122
+ "content": [{
123
+ "type": "image",
124
+ "image":"http://farm7.staticflickr.com/6049/6329686751_997c68fff9_z.jpg"}]
125
+ },
126
+ ]
127
+
128
+ candidate_images, _ = process_vision_info(candidate_imgs)
129
+
130
+ image_inputs = processor(
131
+ text=[img_txt] * len(candidate_images),
132
+ images=candidate_images,
133
+ # videos=,
134
+ padding=True,
135
+ return_tensors="pt",
136
+ ).to("cuda")
137
+
138
+ with torch.inference_mode():
139
+ qry_emb = model(
140
+ **qry_input, output_hidden_states=True, return_dict=True
141
+ ).hidden_states[-1][:, -1, :]
142
+
143
+ img_emb = model(
144
+ **image_inputs, output_hidden_states=True, return_dict=True
145
+ ).hidden_states[-1][:, -1, :]
146
+
147
+ qry_emb = F.normalize(qry_emb, dim=-1)
148
+ img_emb = F.normalize(img_emb, dim=-1)
149
+
150
+ score = qry_emb @ img_emb.t()
151
+ # tensor([[0.2432, 0.0962, 0.0747, 0.0898]], device='cuda:0', dtype=torch.bfloat16)
152
+ # corresponding to the score of photos (cat, dog, baseball, crowd)
153
+ ```
154
+ <br>
155
+
156
+ ### Video Embedding
157
+ ```Python
158
+ vid_message = [
159
+ {
160
+ "role": "user",
161
+ "content": [{
162
+ "type": "video",
163
+ "video": video_path,
164
+ "max_pixels": 262144,
165
+ "fps": 2.0,}]
166
+ }
167
+ ]
168
+
169
+ video_text = processor.apply_chat_template(
170
+ vid_message, tokenize=False, add_generation_prompt=True
171
+ ) + tokenizer.eos_token
172
+
173
+ image_input, video_input = process_vision_info(vid_message)
174
+
175
+ video_input = processor(
176
+ text=[video_text],
177
+ images=image_input,
178
+ videos=video_input,
179
+ padding=True,
180
+ return_tensors="pt",
181
+ ).to("cuda")
182
+
183
+ with torch.inference_mode():
184
+ video_emb = model(
185
+ **video_input, output_hidden_states=True, return_dict=True
186
+ ).hidden_states[-1][:, -1, :]
187
+
188
+ video_emb = F.normalize(video_emb, dim=-1)
189
+
190
+ ```
191
+
192
+
193
+ <br>
194
 
195
  ---
196
  license: cc-by-nc-4.0
197
+ ---