DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

📖 Overview

Accurate dialogue description is a critical yet underexplored aspect of audiovisual video captioning, with profound implications for downstream multimodal understanding and generation tasks. Despite the rapid progress in MLLMs, existing approaches often struggle to faithfully capture who says what in complex audiovisual scenes. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions, while maintaining strong overall captioning performance across general audiovisual content.

To enable systematic evaluation of dialogue description capabilities, we further introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal that even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.

🚀 Getting Started

Please refer to our Github repository for more details.

🖊️ Citation

If you find DiaDem or DiaDemBench helpful for your research, please consider giving this repo a star ⭐ and citing our paper. We appreciate your support!

@article{chen2026diadem,
        title={DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models},
        author={Chen, Xinlong and Lin, Weihong and Hua, Jingyun and Yao, Linli and Ding, Yue and Li, Bozhou and Zeng, Bohan and Shi, Yang and Liu, Qiang and Zhang, Yuanxing and others},
        journal={arXiv preprint arXiv:2601.19267},
        year={2026}
      }

Downloads last month: 3

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DiaDem-Captioner/DiaDem

Base model

Qwen/Qwen2.5-Omni-7B

Finetuned

AVoCaDO-Captioner/AVoCaDO

Finetuned

(1)

this model

Paper for DiaDem-Captioner/DiaDem

DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

Paper • 2601.19267 • Published Jan 27