EurekaTian
/

ROMA

Video-Text-to-Text

video-understanding

audio-understanding

Model card Files Files and versions

ROMA / README.md

EurekaTian's picture

Update README.md

9756644 verified 1 day ago

|

history blame contribute delete

1.29 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- multimodal
	- video-understanding
	- audio-understanding
	- streaming
	- real-time
	- omni-modal
	pipeline_tag: video-text-to-text
	---

	# ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

	<div align="center">
	<img src="architecture.png" width="800"/>
	<p>Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.</p>
	</div>

	## Model Summary

	ROMA is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both Reactive (Question Answering) and Proactive (Event-Driven Alert, Real-Time Narration) capabilities within a single framework.

	ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide when to speak based on the continuous audio-visual stream.

	- Paper: [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/2601.10323)
	- Project Page: [Link](https://eureka-maggie.github.io/ROMA_show/)
	- Repository: [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)]