File size: 7,330 Bytes
d3db735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: cc-by-nc-4.0
base_model:
- Qwen/Qwen2-VL-7B-Instruct
library_name: transformers
tags:
- multimodal
- video embedding
- ncsoft
- ncai
- varco
pipeline_tag: feature-extraction
language:
- en
---

## About GME-VARCO-VISION-Embedding

<div align="center">
    <img src="./varco-vision-Embedding.png" width="100%" style="background-color:white; padding:10px;"/>
</div>

`GME-VARCO-VISION-Embedding` is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high-dimensional embedding space. In particular, the model focuses on video retrieval, which demands greater complexity and contextual understanding compared to image retrieval. It achieves high retrieval accuracy and strong generalization performance across various scenarios, such as scene-based search, description-based search, and question-answering-based search.

## Demo Video
Check out our demo videos showcasing our multimodal embedding model in action:
- [English Demo Video](https://www.youtube.com/watch?v=kCvz82Y1BQg)
- [Korean Demo Video](https://youtube.com/shorts/jC2J7rbAfxs)

The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.

### Model Architecture and Training Method
`GME-VARCO-VISION-Embedding` is based on [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), and uses the parameters of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct) to improve the model's general retrieval ability.

#### 1. Fine-tuning (Contrastive Learning) on video preference dataset
To efficiently fine-tune the model, we utilize [ShareGPTVideo’s 17𝑘 video preference dataset](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction), which includes prompts, videos, gold answers, and chosen-rejected text pairs. We treat the prompts and videos as queries, and the rejected responses
as hard-negatives for the gold answers. . Each query is trained with in-batch negatives as well as one hard negative using the InfoNCE loss. The model is fully fine-tuned for two epochs on 8 A100 GPUs with a batch size of 8, requiring only a few hours for training.

#### 2. Adding Retrieval Vector
To compensate for the insufficiency of training instances and enhance the generalization ability of the fine-tuned model, we compute a retrieval vector 𝜏 by subtracting the weights of the original [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) model from those of [`Alibaba-NLP/gme-Qwen2-VL-7B-Instruct`](https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-7B-Instruct), a Qwen2-VL based image-text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre-trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat-optimized counterpart.


### Performance
Our model achieves **state-of-the-art (SOTA) zero-shot performance** on the MultiVENT2.0 dataset as of July 2025. See the [official leaderboard](https://eval.ai/web/challenges/challenge-page/2507/leaderboard/6262) for detailed results.


<br>

## Code Examples
`GME-VARCO-VISION-Embedding` adopts the inference pipeline of [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

### Image-Text Retrieval

```python
import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "NCSOFT/GME-VARCO-VISION-Embedding"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
device = model.device


qry_msg = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Find a photo of a cat."},
        ],
    },
]

qry_txt = processor.apply_chat_template(
    qry_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token

qry_input = processor(
            text=[qry_txt],
            padding=True,
            return_tensors="pt",
        ).to(device)  


img_msg = [
    {
        "role": "user",
        "content": [{
            "type": "image",
            "image": "image"
        }]
    }
]

img_txt = processor.apply_chat_template(
    img_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token


candidate_imgs= [
        # Photo of two cats
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "http://images.cocodataset.org/val2017/000000039769.jpg"}]
        },
        # Photo of two dogs
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "https://farm1.staticflickr.com/116/290755713_a5de6c1079_z.jpg"}]
        },
        # photo of two people playing baseball
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "http://farm3.staticflickr.com/2418/2193688811_d9f5e23bbd_z.jpg"}]
        },
        # Photo of a large crowd in a busy city street
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image":"http://farm7.staticflickr.com/6049/6329686751_997c68fff9_z.jpg"}]
        },
    ]

candidate_images, _ = process_vision_info(candidate_imgs)

image_inputs = processor(
        text=[img_txt] * len(candidate_images),
        images=candidate_images,
        # videos=,
        padding=True,
        return_tensors="pt",
    ).to(device)

with torch.inference_mode():
    qry_emb = model(
        **qry_input, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

    img_emb = model(
        **image_inputs, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

qry_emb = F.normalize(qry_emb, dim=-1)    
img_emb = F.normalize(img_emb, dim=-1)

score = qry_emb @ img_emb.t()
# tensor([[0.3066, 0.1108, 0.1226, 0.1245]], device='cuda:0', dtype=torch.bfloat16)
# corresponding to the score of photos (cat, dog, baseball, crowd) 
```
<br>

### Video Embedding
```Python
vid_message = [
        {
            "role": "user",
            "content": [{
                "type": "video",
                "video": video_path,
                "max_pixels": 262144,
                "fps": 2.0,}]
        }
    ]

video_text = processor.apply_chat_template(
            vid_message, tokenize=False, add_generation_prompt=True
        ) + tokenizer.eos_token

image_input, video_input = process_vision_info(vid_message)

video_input = processor(
        text=[video_text],
        images=image_input,
        videos=video_input,
        padding=True,
        return_tensors="pt",
    ).to(device)

with torch.inference_mode():
    video_emb = model(
        **video_input, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

video_emb = F.normalize(video_emb, dim=-1)

```


<br>

---
license: cc-by-nc-4.0
---