| --- |
| language: |
| - en |
| - zh |
| license: other |
| license_name: license-term-of-universal-audio-tokenizer |
| tags: |
| - audio |
| - audio-tokenizer |
| - speech-tokenizer |
| - speech |
| - sound |
| - music |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception |
|
|
| **Universal Audio Tokenizer** (UniAudio-Token) is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs. |
|
|
| π [Paper](https://arxiv.org/abs/2605.31521) | π» [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer) |
|
|
| For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer). |
|
|
| ## π‘ Highlights |
|
|
| Existing semantic speech tokenizers often suffer from *acoustic blindness*, while acoustic tokenizers typically lack *linguistic alignment*. |
|
|
| **Universal Audio Tokenizer** bridges this gap through: |
| - π§© **Semantic-Acoustic Primitives (SAP) supervision** that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives |
| - βοΈ **Semantic-Acoustic Equilibrium (SAE) mechanism** that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams |
|
|
| This results in a compact single-codebook audio tokenizer that **simultaneously** enables: |
| * π§ **Seamless LLM Integration**: A unified audio input/output interface in Audio-LLMs |
| * π£οΈ **Linguistic Alignment**: Superior performance on speech reconstruction and TTS synthesis tasks |
| * π― **General Audio Perception**: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks |
|
|
| ## π Model Details |
|
|
| | Attribute | Value | |
| |:----------|:------| |
| | Frame Rate | 25 Hz | |
| | Codebook Size | 8,192 | |
| | Bits Per Second (BPS) | 325 | |
|
|
| ## π Quick Start |
|
|
| To use Universal Audio Tokenizer, please clone the official repository and install dependencies. |
|
|
| ### Installation |
|
|
| ```bash |
| # 1. Clone the repository with all submodules |
| git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git |
| cd Universal_Audio_Tokenizer |
| |
| # If you have already cloned the repository without --recursive, |
| # initialize submodules with: |
| git submodule update --init --recursive |
| |
| # 2. Create a conda environment |
| conda create -n universal-audio-tokenizer python=3.10.13 -y |
| conda activate universal-audio-tokenizer |
| |
| # 3. Install dependencies |
| conda install -c conda-forge libsndfile -y |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Download Pretrained Checkpoints |
|
|
| Using `huggingface-cli`: |
|
|
| ```bash |
| huggingface-cli download tencent/Universal_Audio_Tokenizer \ |
| --local-dir checkpoints/Universal_Audio_Tokenizer |
| ``` |
|
|
| Or using Python: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| snapshot_download( |
| repo_id="tencent/Universal_Audio_Tokenizer", |
| local_dir="checkpoints/Universal_Audio_Tokenizer" |
| ) |
| ``` |
|
|
| ### Run Inference |
|
|
| We provide a simple inference demo in `example_usage.py`. |
|
|
| ```bash |
| python example_usage.py \ |
| --device auto \ |
| --model_path checkpoints/Universal_Audio_Tokenizer \ |
| --audio_path /path/to/audio.wav |
| ``` |
|
|
| The script will: |
|
|
| - load the tokenizer and feature extractor; |
| - extract discrete audio tokens from input audio clips; |
| - reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`. |
|
|
| Also, you can directly run the inference code snippet below: |
|
|
| ```python |
| import os |
| import torch |
| from huggingface_hub import snapshot_download |
| from transformers import WhisperFeatureExtractor |
| from src.model.modeling_whisper import WhisperVQEncoder |
| from src.model.flow_inference import AudioDecoder |
| from src.model.utils import extract_audio_token, speech_token_to_wav |
| |
| # 1. Download & Load Models |
| model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer") |
| |
| # Load tokenizer and feature extractor |
| tokenizer_path = os.path.join(model_dir, "tokenizer") |
| tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda() |
| feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path) |
| |
| # Load decoder |
| decoder_path = os.path.join(model_dir, "decoder") |
| decoder = AudioDecoder( |
| config_path=os.path.join(decoder_path, "config.yaml"), |
| flow_ckpt_path=os.path.join(decoder_path, "flow.pt"), |
| hift_ckpt_path=os.path.join(decoder_path, "hift.pt"), |
| device="cuda" |
| ) |
| |
| # 2. Tokenize |
| tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] |
| |
| # 3. Reconstruct |
| tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) |
| ``` |
|
|
| ## π Performance |
|
|
| Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks. |
|
|
| ### Latent Space Disentanglement |
|
|
| We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space. |
|
|
| | Model | ESC-10 Sil. (β) | ESC-10 Purity (β) | ESC-50 Sil. (β) | ESC-50 Purity (β) | |
| |:---|:---:|:---:|:---:|:---:| |
| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | -0.030 | 0.450 | -0.108 | 0.215 | |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | -0.182 | 0.373 | -0.304 | 0.133 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | -0.016 | 0.413 | -0.100 | 0.216 | |
| | [StableToken](https://github.com/Tencent/StableToken) | -0.035 | 0.468 | -0.096 | 0.174 | |
| | **Ours** | **0.091** | **0.730** | **0.023** | **0.390** | |
|
|
| ### High-Quality Speech Reconstruction |
|
|
| Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers. |
|
|
| | Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 75Hz | 900 | 5.07 | 13.09 | 5.60 | 4.02 | 3.37 | 3.09 | 3.01 | 3.13 | |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | **4.16** | 4.10 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | |
| | [StableToken](https://github.com/Tencent/StableToken) | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 | |
| | **Ours** | 25Hz | 325 | **3.47** | **6.79** | **2.55** | **1.90** | **4.19** | **4.18** | 4.13 | **4.25** | |
|
|
| ### Superior Downstream Audio-LLM Performance |
|
|
| When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks. |
|
|
| #### Audio Understanding |
|
|
| Accuracy on audio understanding benchmarks: |
|
|
| | **Tokenizer** | MMAU<br>(Speech) | MMAU<br>(Sound) | MMAU<br>(Music) | **MMAU<br>(Overall)** | MMAR<br>(Speech) | MMAR<br>(Sound) | MMAR<br>(Music) | **MMAR<br>(Overall)** | MMSU<br>(Perception) | MMSU<br>(Reasoning) | **MMSU<br>(Overall)** | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 36.94 | 60.36 | 57.78 | 51.70 | 39.80 | 31.52 | 29.61 | 36.30 | 32.83 | 45.37 | 38.90 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 39.94 | 61.56 | 62.57 | 54.70 | 41.50 | 35.76 | 30.58 | 38.10 | 27.44 | 45.83 | 36.34 | |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 43.24 | 60.06 | 62.28 | 55.20 | 39.46 | 40.00 | 36.89 | 40.10 | 32.40 | 47.64 | 39.78 | |
| | [StableToken](https://github.com/Tencent/StableToken) | **45.05** | 58.56 | 55.99 | 53.20 | 42.18 | 39.39 | 31.07 | 39.10 | 31.98 | 49.71 | 40.56 | |
| | **Ours** | **45.05** | **70.27** | **67.96** | **61.10** (+5.90) | **45.24** | **43.64** | **40.29** | **45.80** (+5.70) | **35.54** | **52.07** | **43.54** (+2.98) | |
|
|
| #### Controllable TTS Synthesis |
|
|
| Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS). |
|
|
| | Tokenizer | SIM (β) | WER (β) | MOS (β) | |
| |:---|:---:|:---:|:---:| |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | .758 \| **.762** \| .760 | 2.71 \| 1.39 \| 2.05 | 3.75 \| 3.37 \| 3.56 | |
| | **Ours** | **.792** \| .742 \| **.767** | **1.78** \| **1.29** \| **1.54** | **4.07** \| **3.68** \| **3.88** | |
|
|
| ## Citation |
|
|
| If you find our code or model useful for your research, please cite: |
|
|
| ```bibtex |
| @misc{song2026uniaudiotokenempoweringsemanticspeech, |
| title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception}, |
| author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou}, |
| year={2026}, |
| eprint={2605.31521}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2605.31521}, |
| } |
| ``` |
|
|
| ## License |
|
|
| This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE). |