ROMA / README.md

EurekaTian

Update README.md

9756644 verified 1 day ago

preview code

raw

history blame contribute delete

1.29 kB

metadata

license: apache-2.0
library_name: transformers
tags:
  - multimodal
  - video-understanding
  - audio-understanding
  - streaming
  - real-time
  - omni-modal
pipeline_tag: video-text-to-text

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.

Model Summary

ROMA is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both Reactive (Question Answering) and Proactive (Event-Driven Alert, Real-Time Narration) capabilities within a single framework.

ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide when to speak based on the continuous audio-visual stream.

Paper: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Project Page: Link
Repository: [Github (Coming Soon)]