Papers
arxiv:2606.14958

MVEB: Massive Video Embedding Benchmark

Published on Jun 12
· Submitted by
Kenneth C. Enevoldsen
on Jun 16
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset characteristics.

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

Community

Paper author Paper submitter
edited 18 minutes ago

Screenshot 2026-06-16 at 15.15.18

Screenshot 2026-06-16 at 15.15.39


Code for running the benchmark can be found in mteb, while scripts for reproducing paper artifacts will be made available at mveb-paper once the paper has been reviewed and finalized.

This is a really helpful addition to the MTEB ecosystem. It is interesting to see how MLLM-based embeddings are outperforming others in classification and QA, while multimodal binding is still holding the edge for retrieval tasks.

Since the audio contribution depends so heavily on how the labels were originally sourced, do you think we need to start filtering or labeling the datasets by their annotation provenance to avoid performance drops?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/f2a81122-a5ff-40f2-b4c6-cdb2c10c648b

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14958 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14958 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14958 in a Space README.md to link it from this page.

Collections including this paper 1