|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
- Qwen/Qwen2-Audio-7B-Instruct |
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<h1 align="center">NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision</h1> |
|
|
<p align="center"> |
|
|
<strong>Che Liu</strong> |
|
|
, |
|
|
<strong>Yingji Zhang</strong> |
|
|
, |
|
|
<strong>Dong Zhang</strong> |
|
|
, |
|
|
<strong>Weijie Zhang</strong> |
|
|
, |
|
|
<strong>Chenggong Gong</strong> |
|
|
, |
|
|
<strong>Yu Lu</strong> |
|
|
, |
|
|
<strong>Shilin Zhou</strong> |
|
|
, |
|
|
<strong>Ziliang Gan</strong> |
|
|
, |
|
|
<br> |
|
|
<strong>Ziao Wang</strong>, |
|
|
<strong>Haipang Wu</strong>, |
|
|
<strong>Ji Liu</strong>, |
|
|
<strong>Andre Freitas</strong>, |
|
|
<strong>Qifan Wang</strong>, |
|
|
<strong>Zenglin Xu</strong>, |
|
|
<br> |
|
|
<strong>Rongjunchen Zhang</strong><sup>♠</sup>, |
|
|
<strong>Yong Dai</strong><sup>♠</sup> |
|
|
</p> |
|
|
<div class="is-size-5 publication-authors" align="center"> |
|
|
<span class="author-block"> |
|
|
<sup>♠</sup>Corresponding author, daiyongya@outlook.com, zhangrongjunchen@myhexin.com |
|
|
</span> |
|
|
</div> |
|
|
<br> |
|
|
📖<a href="https://arxiv.org/pdf/2503.01879">Paper</a> |🤗<a href="https://huggingface.co/HiThink-Research/NEXUS-O">Model</a></h3> | 🤗<a href="https://huggingface.co/HiThink-Research/NEXUS-O">Training Data (Coming Soon)</a></h3> |
|
|
|
|
|
<div align="center"></div> |
|
|
<p align="center"> |
|
|
<p> |
|
|
NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework. |
|
|
Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning. |
|
|
</p> |
|
|
<img src="static/omni.png"> |
|
|
<p>Architecture of NEXUS-O</p> |
|
|
|
|
|
<img src="static/train_stage.png"> |
|
|
<p>Training Stages</p> |
|
|
|
|
|
## 📢 News |
|
|
- 🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025. |
|
|
|
|
|
## 💡 Highlights |
|
|
- 🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications. |
|
|
- 💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance. |
|
|
- 🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks. |
|
|
|
|
|
## TODO |
|
|
* [x] Rlease NEXUS-O full model weight on HuggingFace |
|
|
* [ ] Rlease Audio Encoder Training Data |
|
|
* [ ] Rlease Audio Decoder Training Data |
|
|
|
|
|
## ✒️Citation |
|
|
``` |
|
|
@article{liu2025nexus, |
|
|
title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision}, |
|
|
author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others}, |
|
|
journal={arXiv preprint arXiv:2503.01879}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## 📄 License |
|
|
  **Usage and License Notices**: The data and code are intended and licensed for research use only. |
|
|
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use |
|
|
|
|
|
## 💖 Acknowledgement |
|
|
|