AVoCaDO / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, library name, and complete citation

3bdfde1 verified 5 months ago

2 kB

base_model:
  - Qwen/Qwen2.5-Omni-7B
license: apache-2.0
tags:
  - audiovisual
  - video
  - captioner
pipeline_tag: video-text-to-text
library_name: transformers

AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.

🚀 Getting Started

Please refer to our Github repository for more details.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{wu2025avocado,
  author    = {Zhiyong Wu and Zichen Ding and Zhenyu Wu and Yian Wang and Peng Li and Chengyou Jia and Zicheng Zhang and Paul Pu Liang and Hu Xu and Hyunwoo J. Kim and Lemeng Wu and Chenchen Zhu and Paul Pu Liang and Mohit Bansal and Liheng Chen},
  title     = {AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  journal   = {arXiv preprint arXiv:2510.10395},
  year      = {2025},
}