HiThink-Research
/

NEXUS-O

Model card Files Files and versions

NEXUS-O / README.md

Tinker250's picture

Update README.md

eaeaf35 verified 4 months ago

|

history blame contribute delete

3.59 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	- Qwen/Qwen2-Audio-7B-Instruct
	---

	<p align="center">
	<h1 align="center">NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision</h1>
	<p align="center">
	<strong>Che Liu</strong>
	,
	<strong>Yingji Zhang</strong>
	,
	<strong>Dong Zhang</strong>
	,
	<strong>Weijie Zhang</strong>
	,
	<strong>Chenggong Gong</strong>
	,
	<strong>Yu Lu</strong>
	,
	<strong>Shilin Zhou</strong>
	,
	<strong>Ziliang Gan</strong>
	,
	<br>
	<strong>Ziao Wang</strong>,
	<strong>Haipang Wu</strong>,
	<strong>Ji Liu</strong>,
	<strong>Andre Freitas</strong>,
	<strong>Qifan Wang</strong>,
	<strong>Zenglin Xu</strong>,
	<br>
	<strong>Rongjunchen Zhang</strong><sup>♠</sup>,
	<strong>Yong Dai</strong><sup>♠</sup>
	</p>
	<div class="is-size-5 publication-authors" align="center">
	<span class="author-block">
	<sup>♠</sup>Corresponding author, daiyongya@outlook.com, zhangrongjunchen@myhexin.com
	</span>
	</div>
	<br>
	📖<a href="https://arxiv.org/pdf/2503.01879">Paper</a> \|🤗<a href="https://huggingface.co/HiThink-Research/NEXUS-O">Model</a></h3> \| 🤗<a href="https://huggingface.co/HiThink-Research/NEXUS-O">Training Data (Coming Soon)</a></h3>

	<div align="center"></div>
	<p align="center">
	<p>
	NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework.
	Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning.
	</p>
	<img src="static/omni.png">
	<p>Architecture of NEXUS-O</p>

	<img src="static/train_stage.png">
	<p>Training Stages</p>

	## 📢 News
	- 🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025.

	## 💡 Highlights
	- 🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications.
	- 💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance.
	- 🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks.

	## TODO
	* [x] Rlease NEXUS-O full model weight on HuggingFace
	* [ ] Rlease Audio Encoder Training Data
	* [ ] Rlease Audio Decoder Training Data

	## ✒️Citation
	```
	@article{liu2025nexus,
	title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision},
	author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others},
	journal={arXiv preprint arXiv:2503.01879},
	year={2025}
	}
	```

	## 📄 License
	![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) Usage and License Notices: The data and code are intended and licensed for research use only.
	License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

	## 💖 Acknowledgement