---
license: other
license_name: license-term-of-universal-audio-tokenizer
language:
- en
- zh
tags:
- audio
- audio-tokenizer
- speech-tokenizer
- speech
- sound
- music
---
# Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception
**Universal Audio Tokenizer** is a compact single-codebook audio tokenizer that unifies general audio perception and
linguistic alignment for downstream Audio-LLMs.
📄 [Paper](https://arxiv.org/abs/2605.31521) | 💻 [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer)
For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer).
## 💡 Highlights
Existing semantic speech tokenizers often suffer from *acoustic blindness*, while acoustic tokenizers typically lack *linguistic alignment*.
**Universal Audio Tokenizer** bridges this gap through:
- 🧩 **Semantic-Acoustic Primitives (SAP) supervision** that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives
- ⚖️ **Semantic-Acoustic Equilibrium (SAE) mechanism** that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams
This results in a compact single-codebook audio tokenizer that **simultaneously** enables:
* 🧠 **Seamless LLM Integration**: A unified audio input/output interface in Audio-LLMs
* 🗣️ **Linguistic Alignment**: Superior performance on speech reconstruction and TTS synthesis tasks
* 🎯 **General Audio Perception**: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks
## 📌 Model Details
| Attribute | Value |
|:----------|:------|
| Frame Rate | 25 Hz |
| Codebook Size | 8,192 |
| Bits Per Second (BPS) | 325 |
## 🚀 Quick Start
To use Universal Audio Tokenizer, please clone the official repository and install dependencies.
### Installation
```bash
# 1. Clone the repository with all submodules
git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git
cd Universal_Audio_Tokenizer
# If you have already cloned the repository without --recursive,
# initialize submodules with:
git submodule update --init --recursive
# 2. Create a conda environment
conda create -n universal-audio-tokenizer python=3.10.13 -y
conda activate universal-audio-tokenizer
# 3. Install dependencies
conda install -c conda-forge libsndfile -y
pip install -r requirements.txt
```
### Download Pretrained Checkpoints
Using `huggingface-cli`:
```bash
huggingface-cli download tencent/Universal_Audio_Tokenizer \
--local-dir checkpoints/Universal_Audio_Tokenizer
```
Or using Python:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="tencent/Universal_Audio_Tokenizer",
local_dir="checkpoints/Universal_Audio_Tokenizer"
)
```
### Run Inference
We provide a simple inference demo in `example_usage.py`.
```bash
python example_usage.py \
--device auto \
--model_path checkpoints/Universal_Audio_Tokenizer \
--audio_path /path/to/audio.wav
```
The script will:
- load the tokenizer and feature extractor;
- extract discrete audio tokens from input audio clips;
- reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`.
Also, you can directly run the inference code snippet below:
```python
import os
import torch
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperVQEncoder
from src.model.flow_inference import AudioDecoder
from src.model.utils import extract_audio_token, speech_token_to_wav
# 1. Download & Load Models
model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer")
# Load tokenizer and feature extractor
tokenizer_path = os.path.join(model_dir, "tokenizer")
tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)
# Load decoder
decoder_path = os.path.join(model_dir, "decoder")
decoder = AudioDecoder(
config_path=os.path.join(decoder_path, "config.yaml"),
flow_ckpt_path=os.path.join(decoder_path, "flow.pt"),
hift_ckpt_path=os.path.join(decoder_path, "hift.pt"),
device="cuda"
)
# 2. Tokenize
tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]
# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
```
## 📊 Performance
Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks.
### Latent Space Disentanglement
We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space.
| Model | ESC-10 Sil. (↑) | ESC-10 Purity (↑) | ESC-50 Sil. (↑) | ESC-50 Purity (↑) |
|:---|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | -0.030 | 0.450 | -0.108 | 0.215 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | -0.182 | 0.373 | -0.304 | 0.133 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | -0.016 | 0.413 | -0.100 | 0.216 |
| [StableToken](https://github.com/Tencent/StableToken) | -0.035 | 0.468 | -0.096 | 0.174 |
| **Ours** | **0.091** | **0.730** | **0.023** | **0.390** |
### High-Quality Speech Reconstruction
Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers.
| Model | Frame
Rate | BPS | WER (↓)
LS-clean | WER (↓)
LS-other | WER (↓)
SEED-en | WER (↓)
SEED-zh | MOS (↑)
LS-clean | MOS (↑)
LS-other | MOS (↑)
SEED-en | MOS (↑)
SEED-zh |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 75Hz | 900 | 5.07 | 13.09 | 5.60 | 4.02 | 3.37 | 3.09 | 3.01 | 3.13 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | **4.16** | 4.10 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
| [StableToken](https://github.com/Tencent/StableToken) | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 |
| **Ours** | 25Hz | 325 | **3.47** | **6.79** | **2.55** | **1.90** | **4.19** | **4.18** | 4.13 | **4.25** |
### Superior Downstream Audio-LLM Performance
When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks, demonstrating its effectiveness as a unified audio input/output interface for Audio-LLMs.
#### Audio Understanding
Accuracy on audio understanding benchmarks:
| **Tokenizer** | MMAU
(Speech) | MMAU
(Sound) | MMAU
(Music) | **MMAU
(Overall)** | MMAR
(Speech) | MMAR
(Sound) | MMAR
(Music) | **MMAR
(Overall)** | MMSU
(Perception) | MMSU
(Reasoning) | **MMSU
(Overall)** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 36.94 | 60.36 | 57.78 | 51.70 | 39.80 | 31.52 | 29.61 | 36.30 | 32.83 | 45.37 | 38.90 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 39.94 | 61.56 | 62.57 | 54.70 | 41.50 | 35.76 | 30.58 | 38.10 | 27.44 | 45.83 | 36.34 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 43.24 | 60.06 | 62.28 | 55.20 | 39.46 | 40.00 | 36.89 | 40.10 | 32.40 | 47.64 | 39.78 |
| [StableToken](https://github.com/Tencent/StableToken) | **45.05** | 58.56 | 55.99 | 53.20 | 42.18 | 39.39 | 31.07 | 39.10 | 31.98 | 49.71 | 40.56 |
| **Ours** | **45.05** | **70.27** | **67.96** | **61.10** (+5.90) | **45.24** | **43.64** | **40.29** | **45.80** (+5.70) | **35.54** | **52.07** | **43.54** (+2.98) |
#### Controllable TTS Synthesis
Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS).
| Tokenizer | SIM (↑) | WER (↓) | MOS (↑) |
|:---|:---:|:---:|:---:|
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | .758 \| **.762** \| .760 | 2.71 \| 1.39 \| 2.05 | 3.75 \| 3.37 \| 3.56 |
| **Ours** | **.792** \| .742 \| **.767** | **1.78** \| **1.29** \| **1.54** | **4.07** \| **3.68** \| **3.88** |
## Citation
If you find our code or model useful for your research, please cite:
```bibtex
@misc{song2026uniaudiotokenempoweringsemanticspeech,
title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception},
author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
year={2026},
eprint={2605.31521},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.31521},
}
```
## License
This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE).