arxiv:2508.08237

VGGSounder: Audio-Visual Evaluations for Foundation Models

Published on Oct 18, 2025

Authors:

Abstract

VGGSounder is introduced as an improved benchmark for audio-visual foundation models, addressing limitations in the original VGGSound dataset through comprehensive re-annotation and multi-label testing.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

View arXiv page View PDF Add to collection

Community

ewriji

about 15 hours ago

•

edited about 15 hours ago

VGGSounder is a re-annotated benchmark built upon the VGGSound dataset, designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:

🔍 Per-label modality tags (audible / visible / both) for all classes in the sample
🎵 Meta labels for background music, voice-over, and static images
📊 Multiple classes per one sample

🌐 Project: https://vggsounder.github.io/
📄 Paper: https://arxiv.org/abs/2508.08237
👨‍💻 Code: https://github.com/Bizilizi/VGGSounder

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2508.08237

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.08237 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.08237 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.08237 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.