metadata
language:
- en
tags:
- multimodal
- vision-language
- audio
- dialogue
- conversation
base_model:
- Qwen/Qwen2-VL-2B-Instruct
license: cc-by-4.0
datasets:
- jihyoung/M3C
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
Overview
This repository contains the Retrieval Module of our model, enabling chatbots to understand and respond to both visual and audio inputs in immersive, real-time interactions.
The Retrieval Module includes:
- π§ Processor β handles multimodal input preprocessing (both visual and audio inputs)
- π Adapter β adapter bridging audio features to the vision-language model
- π Audio Embeddings β audio representations
Model Architecture
Our model consists of a Dialogue Module and a Retrieval Module. Both modules extend Qwen2-VL-2B-Instruct with audio comprehension via a CLAP-based linear adapter.
Description
Our model is built on Qwen2-VL-2B-Instruct, and extends it with audio understanding capabilities using CLAP via a lightweight linear layer adapter. This design allows the model to process and response over textual, visual, and audio inputs within a unified framework.
The full model comprises two modules:
- Dialogue Module β generates contextually appropriate responses grounded in multimodal perception
- Retrieval Module (this repository) β retrieves relevant memory to support long-term, dynamic conversations
Citation
If you find this work useful, please cite:
@inproceedings{jang-etal-2025-enabling,
title = "Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions",
author = {Jang, Jihyoung and
Bae, Minwook and
Kim, Minji and
Hakkani-T{\"u}r, Dilek and
Kim, Hyounghun},
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1519/",
doi = "10.18653/v1/2025.acl-long.1519",
pages = "31481--31512",
ISBN = "979-8-89176-251-0"
}
