File size: 9,413 Bytes

---
language:
- en
- zh
license: other
license_name: license-term-of-universal-audio-tokenizer
tags:
- audio
- audio-tokenizer
- speech-tokenizer
- speech
- sound
- music
pipeline_tag: audio-to-audio
---

# Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception

**Universal Audio Tokenizer** (UniAudio-Token) is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs.

📄 [Paper](https://arxiv.org/abs/2605.31521) | 💻 [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer)

For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer).

## 💡 Highlights

Existing semantic speech tokenizers often suffer from *acoustic blindness*, while acoustic tokenizers typically lack *linguistic alignment*.

**Universal Audio Tokenizer** bridges this gap through:
-   🧩 **Semantic-Acoustic Primitives (SAP) supervision** that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives
-   ⚖️ **Semantic-Acoustic Equilibrium (SAE) mechanism** that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams

This results in a compact single-codebook audio tokenizer that **simultaneously** enables:
*   🧠 **Seamless LLM Integration**: A unified audio input/output interface in Audio-LLMs
*   🗣️ **Linguistic Alignment**: Superior performance on speech reconstruction and TTS synthesis tasks
*   🎯 **General Audio Perception**: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks

## 📌 Model Details

| Attribute | Value |
|:----------|:------|
| Frame Rate | 25 Hz |
| Codebook Size | 8,192 |
| Bits Per Second (BPS) | 325 |

## 🚀 Quick Start

To use Universal Audio Tokenizer, please clone the official repository and install dependencies.

### Installation

```bash
# 1. Clone the repository with all submodules
git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git
cd Universal_Audio_Tokenizer

# If you have already cloned the repository without --recursive,
# initialize submodules with:
git submodule update --init --recursive

# 2. Create a conda environment
conda create -n universal-audio-tokenizer python=3.10.13 -y
conda activate universal-audio-tokenizer

# 3. Install dependencies
conda install -c conda-forge libsndfile -y
pip install -r requirements.txt
```

### Download Pretrained Checkpoints

Using `huggingface-cli`:

```bash
huggingface-cli download tencent/Universal_Audio_Tokenizer \
  --local-dir checkpoints/Universal_Audio_Tokenizer
```

Or using Python:

```python
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="tencent/Universal_Audio_Tokenizer", 
    local_dir="checkpoints/Universal_Audio_Tokenizer"
)
```

### Run Inference

We provide a simple inference demo in `example_usage.py`.

```bash
python example_usage.py \
  --device auto \
  --model_path checkpoints/Universal_Audio_Tokenizer \
  --audio_path /path/to/audio.wav
```

The script will:

- load the tokenizer and feature extractor;
- extract discrete audio tokens from input audio clips;
- reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`.

Also, you can directly run the inference code snippet below:

```python
import os
import torch
from huggingface_hub import snapshot_download
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperVQEncoder
from src.model.flow_inference import AudioDecoder
from src.model.utils import extract_audio_token, speech_token_to_wav

# 1. Download & Load Models
model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer")

# Load tokenizer and feature extractor
tokenizer_path = os.path.join(model_dir, "tokenizer")
tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)

# Load decoder
decoder_path = os.path.join(model_dir, "decoder")
decoder = AudioDecoder(
	config_path=os.path.join(decoder_path, "config.yaml"),
	flow_ckpt_path=os.path.join(decoder_path, "flow.pt"),
	hift_ckpt_path=os.path.join(decoder_path, "hift.pt"),
	device="cuda"
)

# 2. Tokenize
tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
```

## 📊 Performance

Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks.

### Latent Space Disentanglement

We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space.

| Model | ESC-10 Sil. (↑) | ESC-10 Purity (↑) | ESC-50 Sil. (↑) | ESC-50 Purity (↑) |
|:---|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | -0.030 | 0.450 | -0.108 | 0.215 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | -0.182 | 0.373 | -0.304 | 0.133 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | -0.016 | 0.413 | -0.100 | 0.216 |
| [StableToken](https://github.com/Tencent/StableToken) | -0.035 | 0.468 | -0.096 | 0.174 |
| **Ours** | **0.091** | **0.730** | **0.023** | **0.390** |

### High-Quality Speech Reconstruction

Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers.

| Model | Frame<br>Rate | BPS | WER (↓)<br>LS-clean | WER (↓)<br>LS-other | WER (↓)<br>SEED-en | WER (↓)<br>SEED-zh | MOS (↑)<br>LS-clean | MOS (↑)<br>LS-other | MOS (↑)<br>SEED-en | MOS (↑)<br>SEED-zh |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 75Hz | 900 | 5.07 | 13.09 | 5.60 | 4.02 | 3.37 | 3.09 | 3.01 | 3.13 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | **4.16** | 4.10 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
| [StableToken](https://github.com/Tencent/StableToken) | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 |
| **Ours** | 25Hz | 325 | **3.47** | **6.79** | **2.55** | **1.90** | **4.19** | **4.18** | 4.13 | **4.25** |

### Superior Downstream Audio-LLM Performance

When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks.

#### Audio Understanding

Accuracy on audio understanding benchmarks:

| **Tokenizer** | MMAU<br>(Speech) | MMAU<br>(Sound) | MMAU<br>(Music) | **MMAU<br>(Overall)** | MMAR<br>(Speech) | MMAR<br>(Sound) | MMAR<br>(Music) | **MMAR<br>(Overall)** | MMSU<br>(Perception) | MMSU<br>(Reasoning) | **MMSU<br>(Overall)** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 36.94 | 60.36 | 57.78 | 51.70 | 39.80 | 31.52 | 29.61 | 36.30 | 32.83 | 45.37 | 38.90 |
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 39.94 | 61.56 | 62.57 | 54.70 | 41.50 | 35.76 | 30.58 | 38.10 | 27.44 | 45.83 | 36.34 |
| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 43.24 | 60.06 | 62.28 | 55.20 | 39.46 | 40.00 | 36.89 | 40.10 | 32.40 | 47.64 | 39.78 |
| [StableToken](https://github.com/Tencent/StableToken) | **45.05** | 58.56 | 55.99 | 53.20 | 42.18 | 39.39 | 31.07 | 39.10 | 31.98 | 49.71 | 40.56 |
| **Ours** | **45.05** | **70.27** | **67.96** | **61.10** (+5.90) | **45.24** | **43.64** | **40.29** | **45.80** (+5.70) | **35.54** | **52.07** | **43.54** (+2.98) |

#### Controllable TTS Synthesis

Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS).

| Tokenizer | SIM (↑) | WER (↓) | MOS (↑) |
|:---|:---:|:---:|:---:|
| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | .758 \| **.762** \| .760 | 2.71 \| 1.39 \| 2.05 | 3.75 \| 3.37 \| 3.56 |
| **Ours** | **.792** \| .742 \| **.767** | **1.78** \| **1.29** \| **1.54** | **4.07** \| **3.68** \| **3.88** |

## Citation

If you find our code or model useful for your research, please cite:

```bibtex
@misc{song2026uniaudiotokenempoweringsemanticspeech,
      title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception}, 
      author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
      year={2026},
      eprint={2605.31521},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.31521}, 
}
```

## License

This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE).