Update README.md
Browse files
README.md
CHANGED
|
@@ -6,4 +6,85 @@ language:
|
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 8 |
- Qwen/Qwen2-Audio-7B-Instruct
|
| 9 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 8 |
- Qwen/Qwen2-Audio-7B-Instruct
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
<p align="center">
|
| 12 |
+
<h1 align="center">
|
| 13 |
+
<img src="static/logo.png" alt="Nexus-o" height="40" style="position:relative; top:6px;">
|
| 14 |
+
NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision</h1>
|
| 15 |
+
<p align="center">
|
| 16 |
+
<strong>Che Liu</strong>
|
| 17 |
+
,
|
| 18 |
+
<strong>Yingji Zhang</strong>
|
| 19 |
+
,
|
| 20 |
+
<strong>Dong Zhang</strong>
|
| 21 |
+
,
|
| 22 |
+
<strong>Weijie Zhang</strong>
|
| 23 |
+
,
|
| 24 |
+
<strong>Chenggong Gong</strong>
|
| 25 |
+
,
|
| 26 |
+
<strong>Yu Lu</strong>
|
| 27 |
+
,
|
| 28 |
+
<strong>Shilin Zhou</strong>
|
| 29 |
+
,
|
| 30 |
+
<strong>Ziliang Gan</strong>
|
| 31 |
+
,
|
| 32 |
+
<br>
|
| 33 |
+
<strong>Ziao Wang</strong>,
|
| 34 |
+
<strong>Haipang Wu</strong>,
|
| 35 |
+
<strong>Ji Liu</strong>,
|
| 36 |
+
<strong>Andre Freitas</strong>,
|
| 37 |
+
<strong>Qifan Wang</strong>,
|
| 38 |
+
<strong>Zenglin Xu</strong>,
|
| 39 |
+
<br>
|
| 40 |
+
<strong>Rongjunchen Zhang</strong><sup>♠</sup>,
|
| 41 |
+
<strong>Yong Dai</strong><sup>♠</sup>
|
| 42 |
+
</p>
|
| 43 |
+
<div class="is-size-5 publication-authors" align="center">
|
| 44 |
+
<span class="author-block">
|
| 45 |
+
<sup>♠</sup>Corresponding author, daiyongya@outlook.com, zhangrongjunchen@myhexin.com
|
| 46 |
+
</span>
|
| 47 |
+
</div>
|
| 48 |
+
<br>
|
| 49 |
+
📖<a href="https://arxiv.org/pdf/2503.01879">Paper</a> |🤗<a href="https://github.com/HiThink-Research/NEXUS-O">Comming Soon</a></h3>
|
| 50 |
+
|
| 51 |
+
<div align="center"></div>
|
| 52 |
+
<p align="center">
|
| 53 |
+
<p>
|
| 54 |
+
NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework.
|
| 55 |
+
Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning.
|
| 56 |
+
</p>
|
| 57 |
+
<img src="static/omni.png">
|
| 58 |
+
<p>Architecture of NEXUS-O</p>
|
| 59 |
+
|
| 60 |
+
<img src="static/train_stage.png">
|
| 61 |
+
<p>Training Stages</p>
|
| 62 |
+
|
| 63 |
+
## 📢 News
|
| 64 |
+
- 🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025.
|
| 65 |
+
|
| 66 |
+
## 💡 Highlights
|
| 67 |
+
- 🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications.
|
| 68 |
+
- 💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance.
|
| 69 |
+
- 🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks.
|
| 70 |
+
|
| 71 |
+
## TODO
|
| 72 |
+
* [x] Rlease NEXUS-O full model weight on HuggingFace
|
| 73 |
+
* [ ] Rlease Audio Encoder Training Data
|
| 74 |
+
* [ ] Rlease Audio Decoder Training Data
|
| 75 |
+
|
| 76 |
+
## ✒️Citation
|
| 77 |
+
```
|
| 78 |
+
@article{liu2025nexus,
|
| 79 |
+
title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision},
|
| 80 |
+
author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others},
|
| 81 |
+
journal={arXiv preprint arXiv:2503.01879},
|
| 82 |
+
year={2025}
|
| 83 |
+
}
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## 📄 License
|
| 87 |
+
  **Usage and License Notices**: The data and code are intended and licensed for research use only.
|
| 88 |
+
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
|
| 89 |
+
|
| 90 |
+
## 💖 Acknowledgement
|