| | --- |
| | license: apache-2.0 |
| | pipeline_tag: video-text-to-text |
| | library_name: transformers |
| | --- |
| | |
| | <p align="center"> |
| | <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/> |
| | <p> |
| | |
| | <h3 align="center"> |
| | <a href="https://arxiv.org/pdf/2505.17114" style="color:#825987"> |
| | RAVEN: Query-Guided Representation Alignment for Question |
| | Answering over Audio, Video, Embedded Sensors, and Natural Language |
| | </a> |
| | </h3> |
| | <h5 align="center"> |
| | Project Page: |
| | <a href="https://bashlab.github.io/raven_project/" style="color:#825987"> |
| | https://bashlab.github.io/raven_project/ |
| | </a> |
| | • Code: |
| | <a href="https://github.com/BASHLab/RAVEN" style="color:#825987"> |
| | https://github.com/BASHLab/RAVEN |
| | </a> |
| | </h5> |
| | <p align="center"> |
| | <img src="./assets/raven_architecture.png" width="800" /> |
| | <p> |
| | |
| | --- |
| | |
| | ## Abstract |
| | Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL . |
| |
|
| | --- |
| | ## π Main Results |
| | ##### Comparison of **RAVEN** and prior MLLMs on *exocentric* open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us. |
| | <p><img src="./assets/main_result_exo.png" width="800"></p> |
| | |
| | ##### Comparison of **RAVEN** with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. **RAVEN** outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores. |
| | <p><img src="./assets/main_result_ego.png" width="800"></p> |
| | |
| | --- |
| | ## π **AVS-QA** Dataset |
| | Train and test split of **AVS-QA** is provided [here](./avs-qa-dataset/).<br> |
| | More details [here](./avs-qa-dataset/README.md). |
| | |
| | ## π οΈ Requirements and Installation |
| | Basic Dependencies: |
| | * Python >= 3.8 |
| | * Pytorch >= 2.2.0 |
| | * CUDA Version >= 11.8 |
| | * transformers == 4.40.0 (for reproducing paper results) |
| | * tokenizers == 0.19.1 |
| |
|
| | ```bash |
| | cd RAVEN |
| | pip install -r requirements.txt |
| | pip install flash-attn==2.5.8 --no-build-isolation |
| | pip install opencv-python==4.5.5.64 |
| | apt-get update && apt-get install ffmpeg libsm6 libxext6 -y |
| | ``` |
| | --- |
| |
|
| | ## π Model Zoo |
| | | Model Name | Modal Type | |
| | |:----------------|:------------:| |
| | | [RAVEN-7B-AV](https://huggingface.co/BASH-Lab/RAVEN-AV-7B)| AV | |
| | | RAVEN-7B-AVS| AVS | |
| |
|
| | ## π€ Sample Usage |
| | - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
| | - **STEP 2:** Download **RAVEN** checkpoint |
| | ```bash |
| | CUDA_VISIBLE_DEVICES=0 python inference.py --model-path=<MODEL PATH> --modal-type=<MODAL TYPE> |
| | ``` |
| |
|
| | ## π Acknowledgement |
| | The codebase of RAVEN is adapted from [**VideoLLaMA2**](https://github.com/DAMO-NLP-SG/VideoLLaMA2). We are also grateful for their contribution. |