Video-Text-to-Text
Transformers
Safetensors
qwen2_5_omni
multimodal
video-understanding
audio-understanding
streaming
real-time
omni-modal
Instructions to use EurekaTian/ROMA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EurekaTian/ROMA with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("EurekaTian/ROMA") model = AutoModel.from_pretrained("EurekaTian/ROMA") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -24,6 +24,6 @@ pipeline_tag: video-text-to-text
|
|
| 24 |
|
| 25 |
ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide *when* to speak based on the continuous audio-visual stream.
|
| 26 |
|
| 27 |
-
- **Paper:** [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/
|
| 28 |
- **Project Page:** [Link](https://eureka-maggie.github.io/ROMA_show/)
|
| 29 |
- **Repository:** [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)]
|
|
|
|
| 24 |
|
| 25 |
ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide *when* to speak based on the continuous audio-visual stream.
|
| 26 |
|
| 27 |
+
- **Paper:** [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/2601.10323)
|
| 28 |
- **Project Page:** [Link](https://eureka-maggie.github.io/ROMA_show/)
|
| 29 |
- **Repository:** [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)]
|