A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
Abstract
Multi-modal typography attacks demonstrate significantly higher success rates than unimodal attacks by exploiting cross-modal vulnerabilities in audio-visual multi-modal large language models.
As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.
Community
This paper shows that typographic attacks in audio-visual MLLMs can transfer across modalities, and aligned multimodal attacks are much stronger than single-modality ones. The jump from 34.93% to 83.43% ASR is especially striking.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Adversarial Prompt Injection Attack on Multimodal Large Language Models (2026)
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026)
- DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models (2026)
- SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models (2026)
- Evaluation of Audio Language Models for Fairness, Safety, and Security (2026)
- OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models (2026)
- Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.03995 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper