File size: 9,413 Bytes
13787d9 d3c4965 13787d9 d3c4965 13787d9 d3c4965 13787d9 d3c4965 13787d9 d3c4965 13787d9 d3c4965 13787d9 d3c4965 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | ---
language:
- en
- zh
license: other
license_name: license-term-of-universal-audio-tokenizer
tags:
- audio
- audio-tokenizer
- speech-tokenizer
- speech
- sound
- music
pipeline_tag: audio-to-audio
---
# Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception
**Universal Audio Tokenizer** (UniAudio-Token) is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs.
π [Paper](https://arxiv.org/abs/2605.31521) | π» [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer)
For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer).
## π‘ Highlights
Existing semantic speech tokenizers often suffer from *acoustic blindness*, while acoustic tokenizers typically lack *linguistic alignment*.
**Universal Audio Tokenizer** bridges this gap through:
- π§© **Semantic-Acoustic Primitives (SAP) supervision** that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives
- βοΈ **Semantic-Acoustic Equilibrium (SAE) mechanism** that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams
This results in a compact single-codebook audio tokenizer that **simultaneously** enables:
* π§ **Seamless LLM Integration**: A unified audio input/output interface in Audio-LLMs
* π£οΈ **Linguistic Alignment**: Superior performance on speech reconstruction and TTS synthesis tasks
* π― **General Audio Perception**: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks
## π Model Details
| Attribute | Value |
|:----------|:------|
| Frame Rate | 25 Hz |
| Codebook Size | 8,192 |
| Bits Per Second (BPS) | 325 |
## π Quick Start
To use Universal Audio Tokenizer, please clone the official repository and install dependencies.
### Installation
```bash
# 1. Clone the repository with all submodules
git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git
cd Universal_Audio_Tokenizer
# If you have already cloned the repository without --recursive,
# initialize submodules with:
git submodule update --init --recursive
# 2. Create a conda environment
conda create -n universal-audio-tokenizer python=3.10.13 -y
conda activate universal-audio-tokenizer
# 3. Install dependencies
conda install -c conda-forge libsndfile -y
pip install -r requirements.txt
```
### Download Pretrained Checkpoints
Using `huggingface-cli`:
```bash
huggingface-cli download tencent/Universal_Audio_Tokenizer \
--local-dir checkpoints/Universal_Audio_Tokenizer
```
Or using Python:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="tencent/Universal_Audio_Tokenizer",
local_dir="checkpoints/Universal_Audio_Tokenizer"
)
```
### Run Inference
We provide a simple inference demo in `example_usage.py`.
```bash
python example_usage.py \
--device auto \
--model_path checkpoints/Universal_Audio_Tokenizer \
--audio_path /path/to/audio.wav
```
The script will:
- load the tokenizer and feature extractor;
- extract discrete audio tokens from input audio clips;
- reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`.
Also, you can directly run the inference code snippet below:
```python
import os
import torch
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperVQEncoder
from src.model.flow_inference import AudioDecoder
from src.model.utils import extract_audio_token, speech_token_to_wav
# 1. Download & Load Models
model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer")
# Load tokenizer and feature extractor
tokenizer_path = os.path.join(model_dir, "tokenizer")
tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)
# Load decoder
decoder_path = os.path.join(model_dir, "decoder")
decoder = AudioDecoder(
config_path=os.path.join(decoder_path, "config.yaml"),
flow_ckpt_path=os.path.join(decoder_path, "flow.pt"),
hift_ckpt_path=os.path.join(decoder_path, "hift.pt"),
device="cuda"
)
# 2. Tokenize
tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]
# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
```
## π Performance
Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks.
### Latent Space Disentanglement
We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space.
| Model | ESC-10 Sil. (β) | ESC-10 Purity (β) | ESC-50 Sil. (β) | ESC-50 Purity (β) |
|:---|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | -0.030 | 0.450 | -0.108 | 0.215 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | -0.182 | 0.373 | -0.304 | 0.133 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | -0.016 | 0.413 | -0.100 | 0.216 |
| [StableToken](https://github.com/Tencent/StableToken) | -0.035 | 0.468 | -0.096 | 0.174 |
| **Ours** | **0.091** | **0.730** | **0.023** | **0.390** |
### High-Quality Speech Reconstruction
Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers.
| Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 75Hz | 900 | 5.07 | 13.09 | 5.60 | 4.02 | 3.37 | 3.09 | 3.01 | 3.13 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | **4.16** | 4.10 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
| [StableToken](https://github.com/Tencent/StableToken) | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 |
| **Ours** | 25Hz | 325 | **3.47** | **6.79** | **2.55** | **1.90** | **4.19** | **4.18** | 4.13 | **4.25** |
### Superior Downstream Audio-LLM Performance
When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks.
#### Audio Understanding
Accuracy on audio understanding benchmarks:
| **Tokenizer** | MMAU<br>(Speech) | MMAU<br>(Sound) | MMAU<br>(Music) | **MMAU<br>(Overall)** | MMAR<br>(Speech) | MMAR<br>(Sound) | MMAR<br>(Music) | **MMAR<br>(Overall)** | MMSU<br>(Perception) | MMSU<br>(Reasoning) | **MMSU<br>(Overall)** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 36.94 | 60.36 | 57.78 | 51.70 | 39.80 | 31.52 | 29.61 | 36.30 | 32.83 | 45.37 | 38.90 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 39.94 | 61.56 | 62.57 | 54.70 | 41.50 | 35.76 | 30.58 | 38.10 | 27.44 | 45.83 | 36.34 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 43.24 | 60.06 | 62.28 | 55.20 | 39.46 | 40.00 | 36.89 | 40.10 | 32.40 | 47.64 | 39.78 |
| [StableToken](https://github.com/Tencent/StableToken) | **45.05** | 58.56 | 55.99 | 53.20 | 42.18 | 39.39 | 31.07 | 39.10 | 31.98 | 49.71 | 40.56 |
| **Ours** | **45.05** | **70.27** | **67.96** | **61.10** (+5.90) | **45.24** | **43.64** | **40.29** | **45.80** (+5.70) | **35.54** | **52.07** | **43.54** (+2.98) |
#### Controllable TTS Synthesis
Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS).
| Tokenizer | SIM (β) | WER (β) | MOS (β) |
|:---|:---:|:---:|:---:|
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | .758 \| **.762** \| .760 | 2.71 \| 1.39 \| 2.05 | 3.75 \| 3.37 \| 3.56 |
| **Ours** | **.792** \| .742 \| **.767** | **1.78** \| **1.29** \| **1.54** | **4.07** \| **3.68** \| **3.88** |
## Citation
If you find our code or model useful for your research, please cite:
```bibtex
@misc{song2026uniaudiotokenempoweringsemanticspeech,
title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception},
author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
year={2026},
eprint={2605.31521},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.31521},
}
```
## License
This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE). |