README.md · tencent/Unified_Audio

root

initial commit

8fb7827 about 2 months ago

6.2 kB

	---
	license: other
	license_name: license-term-of-unified-audio-schema
	language:
	- en
	- zh
	tags:
	- audio
	- speech
	- sound
	- music
	- audio-understanding
	- ASR
	- audio-captioning
	- TTS
	- audio-language-model
	- audio-llm
	- speech-to-text
	- text-to-speech
	- multimodal
	base_model:
	- Qwen/Qwen2.5-7B
	pipeline_tag: audio-text-to-text
	---

	# Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

	Unified Audio Schema is a novel holistic framework for audio supervision that disentangles and restructures supervision across transcription, paralinguistics, and non-linguistic events.

	📄 [Paper](https://arxiv.org/abs/2604.12506) \| 💻 [GitHub](https://github.com/Tencent/Unified_Audio_Schema)

	This repository provides our model checkpoints trained using Unified Audio Schema. For the complete codebase, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Unified_Audio_Schema).

	## Model Details

	\| Attribute \| Value \|
	\|:----------\|:------\|
	\| Input Modality \| Text and audio \|
	\| Output Modality \| Text and audio \|
	\| Base LLM \| [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) \|
	\| Audio Encoder \| AuT encoder \|
	\| Input Audio Representation Frame Rate \| 12.5 Hz \|
	\| Output Audio Token Codebook Size \| 8,192 \|
	\| Output Audio Token Frame Rate \| 25 Hz \|

	Notes:
	- The model supports interleaved text and audio input/output, enabling flexible multimodal interactions.
	- Speech waveform reconstruction for generated audio tokens relies on the [StableToken](https://huggingface.co/tencent/StableToken) decoder.

	## Quick Start

	### Installation

	```bash
	git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
	cd Unified_Audio_Schema && pip install -r requirements.txt
	```

	### Download Checkpoints

	```bash
	# Model weights
	huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

	# StableToken decoder (required for speech waveform reconstruction)
	huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken
	```

	## Inference

	```python
	import torch
	import torchaudio
	from src.model import UASAudio

	model = UASAudio(
	model_path="checkpoints/Unified_Audio_Schema",
	audio_decoder_path="checkpoints/StableToken/decoder",
	device="cuda" if torch.cuda.is_available() else "cpu",
	)

	dialogue_system_prompt = (
	"User will provide you with a speech instruction. Do it step by step. "
	"First, think about the instruction and respond in a interleaved manner, "
	"with 13 text token followed by 52 audio tokens."
	)

	messages = [
	{"role": "system", "content": dialogue_system_prompt},
	{
	"role": "user",
	"content": [
	{"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
	],
	},
	{"role": "assistant", "content": None},
	]

	generation_config = {
	"max_new_tokens": 4096,
	"temperature": 0.7,
	"repetition_penalty": 1.05,
	"top_p": 0.9,
	"do_sample": True
	}

	_, text, audio_tokens = model(messages, **generation_config)
	print(text)

	if len(audio_tokens) > 0:
	audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
	torchaudio.save("response.wav", audio_array, sampling_rate)
	```

	## Supported Scenarios

	Our model can be applied to a wide range of audio understanding and generation tasks, including:

	- Text-input conversation
	- Speech-input conversation
	- Automatic Speech Recognition (ASR)
	- Audio captioning
	- Text-to-Speech (TTS)

	For more runnable examples, please refer to [`example_usage.ipynb`](https://github.com/Tencent/Unified_Audio_Schema/blob/main/example_usage.ipynb) in the GitHub repository.

	## Evaluation Highlights

	UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks.

	### Audio Understanding

	\| Model \| MMSU<br>(Percep.) \| MMSU<br>(Reason.) \| MMSU<br>(Overall) \| MMAR<br>(Speech) \| MMAR<br>(Sound) \| MMAR<br>(Music) \| MMAR<br>(Overall) \| MMAU<br>(Speech) \| MMAU<br>(Sound) \| MMAU<br>(Music) \| MMAU<br>(Overall) \| Avg. \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| [Kimi-Audio](https://github.com/MoonshotAI/Kimi-Audio) \| <u>44.8</u> \| 75.7 \| <u>59.8</u> \| 58.5 \| 49.7 \| 33.0 \| 48.0 \| 62.2 \| 75.7 \| 66.8 \| 68.2 \| 58.7 \|
	\| [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) \| 42.7 \| 77.6 \| 58.1 \| 59.9 \| 58.8 \| 40.8 \| 56.7 \| 70.6 \| <u>78.1</u> \| 65.9 \| <u>71.5</u> \| <u>62.1</u> \|
	\| [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) \| 42.9 \| 73.2 \| 57.6 \| <u>61.2</u> \| 54.6 \| <u>42.2</u> \| <u>56.8</u> \| <u>68.2</u> \| 79.3 \| <u>68.4</u> \| 72.7 \| 61.9 \|
	\| Ours \| 55.7 \| <u>77.4</u> \| 66.2 \| 66.0 \| 58.8 \| 45.2 \| 60.1 \| 67.0 \| 70.0 \| 71.3 \| 69.4 \| 65.2 \|

	### ASR & TTS

	\| Model \| ASR<br>(LS-clean) \| ASR<br>(AISHELL-1) \| TTS<br>(SeedTTS-en) \| TTS<br>(SeedTTS-zh) \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) \| - \| - \| 2.3 \| 1.4 \|
	\| [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) \| 1.9 \| 1.0 \| 2.1 \| 3.2 \|
	\| [MiMo-Audio](https://github.com/XiaomiMiMo/MiMo-Audio) \| 3.8 \| 1.8 \| 5.4 \| 2.0 \|
	\| Ours \| 2.2 \| 2.3 \| 1.7 \| 1.4 \|

	## Citation

	If you find Unified Audio Schema or our model useful for your research, please cite:

	```bibtex
	@misc{zhang2026transcriptionunifiedaudioschema,
	title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs},
	author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
	year={2026},
	eprint={2604.12506},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.12506},
	}

	@inproceedings{song2026stabletoken,
	title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
	author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
	booktitle={The Fourteenth International Conference on Learning Representations},
	year={2026},
	url={https://openreview.net/forum?id=17DNmdQ9aU}
	}
	```

	## License

	This project is licensed under the [License Term of Unified_Audio_Schema](LICENSE).