File size: 5,593 Bytes
2c2962d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e31242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c2962d
 
 
2e31242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c2962d
 
 
 
 
 
 
 
2e31242
2c2962d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language: en
license: apache-2.0
library_name: transformers
tags:
- pytorch
- video
- retrieval
- embedding
- multimodal
- qwen2.5-vl
pipeline_tag: sentence-similarity
datasets:
- Alibaba-NLP/UVRB
- Vividbot/vast-2m-vi
- TempoFunk/webvid-10M
- OpenGVLab/InternVid
metrics:
- recall
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
---

# 🎯 General Video Embedder (GVE)

> **One Embedder for All Video Retrieval Scenarios**  
> Queries of text, image, video, or any combination modalities β€” GVE understands them all for representations, zero-shot, without in-domain training.

GVE is the first video embedding model that **generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains** β€” from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval β€” all evaluated on our new **Universal Video Retrieval Benchmark (UVRB)**.

Built on **Qwen2.5-VL** and trained only with LoRA with **13M** collected and synthesized multimodal data, GVE achieves **SOTA zero-shot performance** than competitors.

---

## 🌟 Why GVE?

| Capability | Existing Works | **GVE** |
|-----------|-------------------|--------|
| **Query Flexibility** | Only text | βœ… Text, βœ… Image, βœ… Video, βœ… Text+Image, βœ… Text+Video |
| **Fine-grained Understanding** | Weak on spatial-temporal details | **S: 0.821**, **T: 0.469** (SOTA) |
| **Training Data** | Uses in-domain test data (e.g., MSRVTT) | **Synthesized data** β€” true zero-shot |
| **Performance** | Unite-7B (8.3B): 55.9 | **GVE-3B (3.8B): 0.571** β†’ **better with half the size**; **GVE-7B (3.8B): 0.600** |

---

## πŸ“Š Performance on UVRB

- TXT: Textual Video Retrieval
- CMP: Composed Video Retrieval
- VIS: Visual Video Retrieval
- CG: Coarse-grained Video Retrieval
- FG: Fine-grained Video Retrieval
- LC: Long-Context Video Retrieval
- S: Spatial Video Retrieval
- T: Temporal Video Retrieval
- PR: Partially Relevant Video Retrieval

> For each column: highest score is **bolded**, second-highest is <u>underlined</u>.

| Model | **AVG** | TXT | CMP | VIS | CG | FG | LC | S | T | PR |
|-------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| CLIP4Clip | 0.416 | 0.401 | 0.178 | **0.714** | 0.380 | 0.360 | 0.463 | 0.559 | 0.285 | 0.236 |
| ViCLIP | 0.375 | 0.336 | 0.263 | 0.640 | 0.380 | 0.315 | 0.313 | 0.484 | 0.289 | 0.171 |
| VideoCLIP-XL | 0.510 | 0.550 | 0.227 | 0.632 | <u>0.558</u> | 0.493 | 0.600 | 0.787 | 0.381 | 0.310 |
| LanguageBind | 0.508 | 0.543 | 0.231 | 0.645 | 0.539 | 0.479 | 0.610 | 0.723 | 0.378 | 0.336 |
| InternVideo2-1B | 0.420 | 0.422 | 0.248 | 0.581 | 0.480 | 0.403 | 0.383 | 0.606 | 0.413 | 0.189 |
| InternVideo2-6B | 0.445 | 0.448 | 0.220 | 0.660 | 0.504 | 0.417 | 0.423 | 0.631 | 0.400 | 0.220 |
| GME-2B | 0.416 | 0.539 | **0.345** | 0.597 | 0.461 | 0.471 | 0.685 | 0.716 | 0.349 | 0.347 |
| Unite-2B | 0.507 | 0.536 | 0.242 | 0.654 | 0.455 | 0.471 | 0.681 | 0.725 | 0.347 | 0.341 |
| VLM2Vec-V2 | 0.538 | 0.587 | 0.263 | 0.613 | 0.498 | 0.502 | 0.762 | 0.809 | 0.348 | 0.348 |
| BGE-VL | 0.480 | 0.497 | 0.268 | 0.622 | 0.448 | 0.406 | 0.636 | 0.664 | 0.292 | 0.261 |
| UniME-7B | 0.542 | 0.561 | 0.308 | <u>0.702</u> | 0.500 | 0.518 | 0.664 | 0.785 | 0.396 | 0.373 |
| B3-7B | 0.538 | 0.570 | 0.270 | 0.678 | 0.482 | 0.505 | 0.722 | 0.797 | 0.364 | 0.355 |
| GME-7B | 0.562 | 0.604 | <u>0.341</u> | 0.615 | 0.518 | 0.507 | <u>0.788</u> | 0.749 | 0.373 | 0.398 |
| Unite-7B | 0.559 | 0.609 | 0.254 | 0.666 | 0.541 | 0.539 | 0.746 | 0.779 | 0.412 | **0.425** |
| **GVE-3B** | <u>0.571</u> | <u>0.619</u> | 0.304 | 0.647 | 0.552 | <u>0.541</u> | 0.764 | <u>0.816</u> | <u>0.430</u> | 0.377 |
| **GVE-7B** | **0.600** | **0.657** | 0.312 | 0.657 | **0.587** | **0.570** | **0.814** | **0.821** | **0.469** | <u>0.419</u> |

---

## πŸš€ Get Started

1. Loading model

```python
model_path = 'Alibaba-NLP/GVE-3B'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = 'left'
```

2. Processing inputs

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "./asset/video_example.mp4",
                "max_pixels": 200 * 28 * 28,
                "fps": 1.0,
                "max_frames": 8,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
texts = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[texts],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    truncation=True,
    max_length=1200,
    return_tensors="pt",
    **video_kwargs,
).to("cuda")
```

3. Embedding

```python
outputs = model(**inputs)
embedding = F.normalize(outputs['last_hidden_state'][:, -1, :], p=2, dim=1)
```

## πŸ“š Citation

```bibtex
@misc{guo2025gve,
  title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum}, 
  author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
  year={2025},
  eprint={2510.27571},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.27571}, 
}
```