arxiv:2510.11178

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Published on Oct 13, 2025

Authors:

Abstract

Vision-language models demonstrate significant vulnerabilities in cultural understanding when evaluated through a multimodal, multicultural benchmark that tests robustness across linguistic variations and visual modalities.

AI-generated summary

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region to Entity, (ii) an inverted text-only variant (Entity to Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.11178 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.11178 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.11178 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.