Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<p align="center">
|
| 2 |
+
<img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
|
| 3 |
+
<p>
|
| 4 |
+
|
| 5 |
+
<h3 align="center">
|
| 6 |
+
<a href="https://arxiv.org/pdf/2505.17114" style="color:#825987">
|
| 7 |
+
RAVEN: Query-Guided Representation Alignment for Question
|
| 8 |
+
Answering over Audio, Video, Embedded Sensors, and Natural Language
|
| 9 |
+
</a>
|
| 10 |
+
</h3>
|
| 11 |
+
<h5 align="center">
|
| 12 |
+
Project Page:
|
| 13 |
+
<a href="https://bashlab.github.io/raven_project/" style="color:#825987">
|
| 14 |
+
https://bashlab.github.io/raven_project/
|
| 15 |
+
</a>
|
| 16 |
+
</h5>
|
| 17 |
+
<p align="center">
|
| 18 |
+
<img src="./assets/raven_architecture.png" width="800" />
|
| 19 |
+
<p>
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 🛠️ Requirements and Installation
|
| 24 |
+
Basic Dependencies:
|
| 25 |
+
* Python >= 3.8
|
| 26 |
+
* Pytorch >= 2.2.0
|
| 27 |
+
* CUDA Version >= 11.8
|
| 28 |
+
* transformers == 4.40.0 (for reproducing paper results)
|
| 29 |
+
* tokenizers == 0.19.1
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
cd RAVEN
|
| 33 |
+
pip install -r requirements.txt
|
| 34 |
+
pip install flash-attn==2.5.8 --no-build-isolation
|
| 35 |
+
pip install opencv-python==4.5.5.64
|
| 36 |
+
apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
|
| 37 |
+
```
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 🤖 Inference
|
| 41 |
+
- **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
|
| 42 |
+
- **STEP 2:** Download **RAVEN** checkpoint
|
| 43 |
+
```bash
|
| 44 |
+
CUDA_VISIBLE_DEVICES=0 python inference.py --model-path=<MODEL PATH> --modal-type=<MODAL TYPE>
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## 👍 Acknowledgement
|
| 48 |
+
The codebase of RAVEN is adapted from [**VideoLLaMA2**](https://github.com/DAMO-NLP-SG/VideoLLaMA2). We are also grateful for their contribution.
|