GVE-3B / README.md

Update README.md

2c2962d verified 3 months ago

5.59 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	tags:
	- pytorch
	- video
	- retrieval
	- embedding
	- multimodal
	- qwen2.5-vl
	pipeline_tag: sentence-similarity
	datasets:
	- Alibaba-NLP/UVRB
	- Vividbot/vast-2m-vi
	- TempoFunk/webvid-10M
	- OpenGVLab/InternVid
	metrics:
	- recall
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	---

	# 🎯 General Video Embedder (GVE)

	> One Embedder for All Video Retrieval Scenarios
	> Queries of text, image, video, or any combination modalities — GVE understands them all for representations, zero-shot, without in-domain training.

	GVE is the first video embedding model that generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains — from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval — all evaluated on our new Universal Video Retrieval Benchmark (UVRB).

	Built on Qwen2.5-VL and trained only with LoRA with 13M collected and synthesized multimodal data, GVE achieves SOTA zero-shot performance than competitors.

	---

	## 🌟 Why GVE?

	\| Capability \| Existing Works \| GVE \|
	\|-----------\|-------------------\|--------\|
	\| Query Flexibility \| Only text \| ✅ Text, ✅ Image, ✅ Video, ✅ Text+Image, ✅ Text+Video \|
	\| Fine-grained Understanding \| Weak on spatial-temporal details \| S: 0.821, T: 0.469 (SOTA) \|
	\| Training Data \| Uses in-domain test data (e.g., MSRVTT) \| Synthesized data — true zero-shot \|
	\| Performance \| Unite-7B (8.3B): 55.9 \| GVE-3B (3.8B): 0.571 → better with half the size; GVE-7B (3.8B): 0.600 \|

	---

	## 📊 Performance on UVRB

	- TXT: Textual Video Retrieval
	- CMP: Composed Video Retrieval
	- VIS: Visual Video Retrieval
	- CG: Coarse-grained Video Retrieval
	- FG: Fine-grained Video Retrieval
	- LC: Long-Context Video Retrieval
	- S: Spatial Video Retrieval
	- T: Temporal Video Retrieval
	- PR: Partially Relevant Video Retrieval

	> For each column: highest score is bolded, second-highest is <u>underlined</u>.

	\| Model \| AVG \| TXT \| CMP \| VIS \| CG \| FG \| LC \| S \| T \| PR \|
	\|-------\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|
	\| CLIP4Clip \| 0.416 \| 0.401 \| 0.178 \| 0.714 \| 0.380 \| 0.360 \| 0.463 \| 0.559 \| 0.285 \| 0.236 \|
	\| ViCLIP \| 0.375 \| 0.336 \| 0.263 \| 0.640 \| 0.380 \| 0.315 \| 0.313 \| 0.484 \| 0.289 \| 0.171 \|
	\| VideoCLIP-XL \| 0.510 \| 0.550 \| 0.227 \| 0.632 \| <u>0.558</u> \| 0.493 \| 0.600 \| 0.787 \| 0.381 \| 0.310 \|
	\| LanguageBind \| 0.508 \| 0.543 \| 0.231 \| 0.645 \| 0.539 \| 0.479 \| 0.610 \| 0.723 \| 0.378 \| 0.336 \|
	\| InternVideo2-1B \| 0.420 \| 0.422 \| 0.248 \| 0.581 \| 0.480 \| 0.403 \| 0.383 \| 0.606 \| 0.413 \| 0.189 \|
	\| InternVideo2-6B \| 0.445 \| 0.448 \| 0.220 \| 0.660 \| 0.504 \| 0.417 \| 0.423 \| 0.631 \| 0.400 \| 0.220 \|
	\| GME-2B \| 0.416 \| 0.539 \| 0.345 \| 0.597 \| 0.461 \| 0.471 \| 0.685 \| 0.716 \| 0.349 \| 0.347 \|
	\| Unite-2B \| 0.507 \| 0.536 \| 0.242 \| 0.654 \| 0.455 \| 0.471 \| 0.681 \| 0.725 \| 0.347 \| 0.341 \|
	\| VLM2Vec-V2 \| 0.538 \| 0.587 \| 0.263 \| 0.613 \| 0.498 \| 0.502 \| 0.762 \| 0.809 \| 0.348 \| 0.348 \|
	\| BGE-VL \| 0.480 \| 0.497 \| 0.268 \| 0.622 \| 0.448 \| 0.406 \| 0.636 \| 0.664 \| 0.292 \| 0.261 \|
	\| UniME-7B \| 0.542 \| 0.561 \| 0.308 \| <u>0.702</u> \| 0.500 \| 0.518 \| 0.664 \| 0.785 \| 0.396 \| 0.373 \|
	\| B3-7B \| 0.538 \| 0.570 \| 0.270 \| 0.678 \| 0.482 \| 0.505 \| 0.722 \| 0.797 \| 0.364 \| 0.355 \|
	\| GME-7B \| 0.562 \| 0.604 \| <u>0.341</u> \| 0.615 \| 0.518 \| 0.507 \| <u>0.788</u> \| 0.749 \| 0.373 \| 0.398 \|
	\| Unite-7B \| 0.559 \| 0.609 \| 0.254 \| 0.666 \| 0.541 \| 0.539 \| 0.746 \| 0.779 \| 0.412 \| 0.425 \|
	\| GVE-3B \| <u>0.571</u> \| <u>0.619</u> \| 0.304 \| 0.647 \| 0.552 \| <u>0.541</u> \| 0.764 \| <u>0.816</u> \| <u>0.430</u> \| 0.377 \|
	\| GVE-7B \| 0.600 \| 0.657 \| 0.312 \| 0.657 \| 0.587 \| 0.570 \| 0.814 \| 0.821 \| 0.469 \| <u>0.419</u> \|

	---

	## 🚀 Get Started

	1. Loading model

	```python
	model_path = 'Alibaba-NLP/GVE-3B'
	model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map='auto', low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
	processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
	processor.tokenizer.padding_side = 'left'
	```

	2. Processing inputs

	```python
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{
	"type": "video",
	"video": "./asset/video_example.mp4",
	"max_pixels": 200 * 28 * 28,
	"fps": 1.0,
	"max_frames": 8,
	},
	{"type": "text", "text": "Describe this video."},
	],
	}
	]
	texts = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
	inputs = processor(
	text=[texts],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	truncation=True,
	max_length=1200,
	return_tensors="pt",
	**video_kwargs,
	).to("cuda")
	```

	3. Embedding

	```python
	outputs = model(**inputs)
	embedding = F.normalize(outputs['last_hidden_state'][:, -1, :], p=2, dim=1)
	```

	## 📚 Citation

	```bibtex
	@misc{guo2025gve,
	title={Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum},
	author={Zhuoning Guo and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Xiaowen Chu},
	year={2025},
	eprint={2510.27571},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2510.27571},
	}
	```