| --- |
| license: other |
| license_name: license-term-of-unified-audio-schema |
| language: |
| - en |
| - zh |
| tags: |
| - audio |
| - speech |
| - sound |
| - music |
| - audio-understanding |
| - ASR |
| - audio-captioning |
| - TTS |
| - audio-language-model |
| - audio-llm |
| - speech-to-text |
| - text-to-speech |
| - multimodal |
| base_model: |
| - Qwen/Qwen2.5-7B |
| pipeline_tag: audio-text-to-text |
| --- |
| |
| # Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs |
|
|
| **Unified Audio Schema** is a novel holistic framework for audio supervision that disentangles and restructures supervision across **transcription**, **paralinguistics**, and **non-linguistic events**. |
|
|
| 📄 [Paper](https://arxiv.org/abs/2604.12506) | 💻 [GitHub](https://github.com/Tencent/Unified_Audio_Schema) |
|
|
| This repository provides our model checkpoints trained using **Unified Audio Schema**. For the complete codebase, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Unified_Audio_Schema). |
|
|
| ## Model Details |
|
|
| | Attribute | Value | |
| |:----------|:------| |
| | Input Modality | Text and audio | |
| | Output Modality | Text and audio | |
| | Base LLM | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) | |
| | Audio Encoder | AuT encoder | |
| | Input Audio Representation Frame Rate | 12.5 Hz | |
| | Output Audio Token Codebook Size | 8,192 | |
| | Output Audio Token Frame Rate | 25 Hz | |
|
|
| Notes: |
| - The model supports interleaved text and audio input/output, enabling flexible multimodal interactions. |
| - Speech waveform reconstruction for generated audio tokens relies on the [StableToken](https://huggingface.co/tencent/StableToken) decoder. |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git |
| cd Unified_Audio_Schema && pip install -r requirements.txt |
| ``` |
|
|
| ### Download Checkpoints |
|
|
| ```bash |
| # Model weights |
| huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema |
| |
| # StableToken decoder (required for speech waveform reconstruction) |
| huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken |
| ``` |
|
|
| ## Inference |
|
|
| ```python |
| import torch |
| import torchaudio |
| from src.model import UASAudio |
| |
| model = UASAudio( |
| model_path="checkpoints/Unified_Audio_Schema", |
| audio_decoder_path="checkpoints/StableToken/decoder", |
| device="cuda" if torch.cuda.is_available() else "cpu", |
| ) |
| |
| dialogue_system_prompt = ( |
| "User will provide you with a speech instruction. Do it step by step. " |
| "First, think about the instruction and respond in a interleaved manner, " |
| "with 13 text token followed by 52 audio tokens." |
| ) |
| |
| messages = [ |
| {"role": "system", "content": dialogue_system_prompt}, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"}, |
| ], |
| }, |
| {"role": "assistant", "content": None}, |
| ] |
| |
| generation_config = { |
| "max_new_tokens": 4096, |
| "temperature": 0.7, |
| "repetition_penalty": 1.05, |
| "top_p": 0.9, |
| "do_sample": True |
| } |
| |
| _, text, audio_tokens = model(messages, **generation_config) |
| print(text) |
| |
| if len(audio_tokens) > 0: |
| audio_array, sampling_rate = model.tokens_to_audio(audio_tokens) |
| torchaudio.save("response.wav", audio_array, sampling_rate) |
| ``` |
|
|
| ## Supported Scenarios |
|
|
| Our model can be applied to a wide range of audio understanding and generation tasks, including: |
|
|
| - Text-input conversation |
| - Speech-input conversation |
| - Automatic Speech Recognition (ASR) |
| - Audio captioning |
| - Text-to-Speech (TTS) |
|
|
| For more runnable examples, please refer to [`example_usage.ipynb`](https://github.com/Tencent/Unified_Audio_Schema/blob/main/example_usage.ipynb) in the GitHub repository. |
|
|
| ## Evaluation Highlights |
|
|
| UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks. |
|
|
| ### Audio Understanding |
|
|
| | **Model** | MMSU<br>(Percep.) | MMSU<br>(Reason.) | **MMSU<br>(Overall)** | MMAR<br>(Speech) | MMAR<br>(Sound) | MMAR<br>(Music) | **MMAR<br>(Overall)** | MMAU<br>(Speech) | MMAU<br>(Sound) | MMAU<br>(Music) | **MMAU<br>(Overall)** | **Avg.** | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
| | [Kimi-Audio](https://github.com/MoonshotAI/Kimi-Audio) | <u>44.8</u> | 75.7 | <u>59.8</u> | 58.5 | 49.7 | 33.0 | 48.0 | 62.2 | 75.7 | 66.8 | 68.2 | 58.7 | |
| | [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | 42.7 | **77.6** | 58.1 | 59.9 | **58.8** | 40.8 | 56.7 | **70.6** | <u>78.1</u> | 65.9 | <u>71.5</u> | <u>62.1</u> | |
| | [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 42.9 | 73.2 | 57.6 | <u>61.2</u> | 54.6 | <u>42.2</u> | <u>56.8</u> | <u>68.2</u> | **79.3** | <u>68.4</u> | **72.7** | 61.9 | |
| | **Ours** | **55.7** | <u>77.4</u> | **66.2** | **66.0** | **58.8** | **45.2** | **60.1** | 67.0 | 70.0 | **71.3** | 69.4 | **65.2** | |
|
|
| ### ASR & TTS |
|
|
| | Model | ASR<br>(LS-clean) | ASR<br>(AISHELL-1) | TTS<br>(SeedTTS-en) | TTS<br>(SeedTTS-zh) | |
| | :--- | :---: | :---: | :---: | :---: | |
| | [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | - | - | 2.3 | 1.4 | |
| | [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 1.9 | 1.0 | 2.1 | 3.2 | |
| | [MiMo-Audio](https://github.com/XiaomiMiMo/MiMo-Audio) | 3.8 | 1.8 | 5.4 | 2.0 | |
| | **Ours** | 2.2 | 2.3 | 1.7 | 1.4 | |
|
|
| ## Citation |
|
|
| If you find Unified Audio Schema or our model useful for your research, please cite: |
|
|
| ```bibtex |
| @misc{zhang2026transcriptionunifiedaudioschema, |
| title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, |
| author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou}, |
| year={2026}, |
| eprint={2604.12506}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2604.12506}, |
| } |
| |
| @inproceedings{song2026stabletoken, |
| title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s}, |
| author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao}, |
| booktitle={The Fourteenth International Conference on Learning Representations}, |
| year={2026}, |
| url={https://openreview.net/forum?id=17DNmdQ9aU} |
| } |
| ``` |
|
|
| ## License |
|
|
| This project is licensed under the [License Term of Unified_Audio_Schema](LICENSE). |
|
|