File size: 1,292 Bytes
936754e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5dff69
936754e
 
 
 
 
e5dff69
936754e
e5dff69
936754e
9756644
e5dff69
6cdb469
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
license: apache-2.0
library_name: transformers
tags:
- multimodal
- video-understanding
- audio-understanding
- streaming
- real-time
- omni-modal
pipeline_tag: video-text-to-text
---

# ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

<div align="center">
  <img src="architecture.png" width="800"/>
  <p>Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.</p>
</div>

## Model Summary

**ROMA** is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both **Reactive** (Question Answering) and **Proactive** (Event-Driven Alert, Real-Time Narration) capabilities within a single framework.

ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide *when* to speak based on the continuous audio-visual stream.

- **Paper:** [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/2601.10323)
- **Project Page:** [Link](https://eureka-maggie.github.io/ROMA_show/)
- **Repository:** [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)]