arxiv:2601.20618

GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

Published on Jan 28

· Submitted by

Lian Junhong on Jan 29

Institute of Computing Technology, Chinese Academy of Sciences

Upvote

Authors:

Abstract

A multimodal sarcasm detection approach uses generative models to create stable semantic anchors and measures cross-modal discrepancies for improved accuracy and robustness.

AI-generated summary

Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.

View arXiv page View PDF Add to collection

Community

THEATLAS

Paper submitter about 12 hours ago

Existing multimodal sarcasm detection methods struggle with loosely related image-text pairs and noisy LLM-generated cues. GDCNet addresses this by using MLLM-generated factual captions as semantic anchors to compute semantic and sentiment discrepancies against the original text, fused via a gated module. The approach achieves state-of-the-art performance on the MMSD2.0 benchmark, demonstrating superior robustness in capturing cross-modal incongruities.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.20618 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.20618 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.20618 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.