|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- multimodal |
|
|
- video-understanding |
|
|
- audio-understanding |
|
|
- streaming |
|
|
- real-time |
|
|
- omni-modal |
|
|
pipeline_tag: video-text-to-text |
|
|
--- |
|
|
|
|
|
# ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding |
|
|
|
|
|
<div align="center"> |
|
|
<img src="architecture.png" width="800"/> |
|
|
<p>Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.</p> |
|
|
</div> |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
**ROMA** is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both **Reactive** (Question Answering) and **Proactive** (Event-Driven Alert, Real-Time Narration) capabilities within a single framework. |
|
|
|
|
|
ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide *when* to speak based on the continuous audio-visual stream. |
|
|
|
|
|
- **Paper:** [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/2601.10323) |
|
|
- **Project Page:** [Link](https://eureka-maggie.github.io/ROMA_show/) |
|
|
- **Repository:** [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)] |