--- language: - en - zh license: other license_name: license-term-of-universal-audio-tokenizer tags: - audio - audio-tokenizer - speech-tokenizer - speech - sound - music pipeline_tag: audio-to-audio --- # Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception **Universal Audio Tokenizer** (UniAudio-Token) is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs. 📄 [Paper](https://arxiv.org/abs/2605.31521) | 💻 [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer) For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer). ## 💡 Highlights Existing semantic speech tokenizers often suffer from *acoustic blindness*, while acoustic tokenizers typically lack *linguistic alignment*. **Universal Audio Tokenizer** bridges this gap through: - 🧩 **Semantic-Acoustic Primitives (SAP) supervision** that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives - ⚖️ **Semantic-Acoustic Equilibrium (SAE) mechanism** that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams This results in a compact single-codebook audio tokenizer that **simultaneously** enables: * 🧠 **Seamless LLM Integration**: A unified audio input/output interface in Audio-LLMs * 🗣️ **Linguistic Alignment**: Superior performance on speech reconstruction and TTS synthesis tasks * 🎯 **General Audio Perception**: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks ## 📌 Model Details | Attribute | Value | |:----------|:------| | Frame Rate | 25 Hz | | Codebook Size | 8,192 | | Bits Per Second (BPS) | 325 | ## 🚀 Quick Start To use Universal Audio Tokenizer, please clone the official repository and install dependencies. ### Installation ```bash # 1. Clone the repository with all submodules git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git cd Universal_Audio_Tokenizer # If you have already cloned the repository without --recursive, # initialize submodules with: git submodule update --init --recursive # 2. Create a conda environment conda create -n universal-audio-tokenizer python=3.10.13 -y conda activate universal-audio-tokenizer # 3. Install dependencies conda install -c conda-forge libsndfile -y pip install -r requirements.txt ``` ### Download Pretrained Checkpoints Using `huggingface-cli`: ```bash huggingface-cli download tencent/Universal_Audio_Tokenizer \ --local-dir checkpoints/Universal_Audio_Tokenizer ``` Or using Python: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="tencent/Universal_Audio_Tokenizer", local_dir="checkpoints/Universal_Audio_Tokenizer" ) ``` ### Run Inference We provide a simple inference demo in `example_usage.py`. ```bash python example_usage.py \ --device auto \ --model_path checkpoints/Universal_Audio_Tokenizer \ --audio_path /path/to/audio.wav ``` The script will: - load the tokenizer and feature extractor; - extract discrete audio tokens from input audio clips; - reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`. Also, you can directly run the inference code snippet below: ```python import os import torch from huggingface_hub import snapshot_download from transformers import WhisperFeatureExtractor from src.model.modeling_whisper import WhisperVQEncoder from src.model.flow_inference import AudioDecoder from src.model.utils import extract_audio_token, speech_token_to_wav # 1. Download & Load Models model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer") # Load tokenizer and feature extractor tokenizer_path = os.path.join(model_dir, "tokenizer") tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda() feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path) # Load decoder decoder_path = os.path.join(model_dir, "decoder") decoder = AudioDecoder( config_path=os.path.join(decoder_path, "config.yaml"), flow_ckpt_path=os.path.join(decoder_path, "flow.pt"), hift_ckpt_path=os.path.join(decoder_path, "hift.pt"), device="cuda" ) # 2. Tokenize tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] # 3. Reconstruct tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) ``` ## 📊 Performance Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks. ### Latent Space Disentanglement We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space. | Model | ESC-10 Sil. (↑) | ESC-10 Purity (↑) | ESC-50 Sil. (↑) | ESC-50 Purity (↑) | |:---|:---:|:---:|:---:|:---:| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | -0.030 | 0.450 | -0.108 | 0.215 | | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | -0.182 | 0.373 | -0.304 | 0.133 | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | -0.016 | 0.413 | -0.100 | 0.216 | | [StableToken](https://github.com/Tencent/StableToken) | -0.035 | 0.468 | -0.096 | 0.174 | | **Ours** | **0.091** | **0.730** | **0.023** | **0.390** | ### High-Quality Speech Reconstruction Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers. | Model | Frame
Rate | BPS | WER (↓)
LS-clean | WER (↓)
LS-other | WER (↓)
SEED-en | WER (↓)
SEED-zh | MOS (↑)
LS-clean | MOS (↑)
LS-other | MOS (↑)
SEED-en | MOS (↑)
SEED-zh | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 75Hz | 900 | 5.07 | 13.09 | 5.60 | 4.02 | 3.37 | 3.09 | 3.01 | 3.13 | | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | **4.16** | 4.10 | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | | [StableToken](https://github.com/Tencent/StableToken) | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 | | **Ours** | 25Hz | 325 | **3.47** | **6.79** | **2.55** | **1.90** | **4.19** | **4.18** | 4.13 | **4.25** | ### Superior Downstream Audio-LLM Performance When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks. #### Audio Understanding Accuracy on audio understanding benchmarks: | **Tokenizer** | MMAU
(Speech) | MMAU
(Sound) | MMAU
(Music) | **MMAU
(Overall)** | MMAR
(Speech) | MMAR
(Sound) | MMAR
(Music) | **MMAR
(Overall)** | MMSU
(Perception) | MMSU
(Reasoning) | **MMSU
(Overall)** | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) | 36.94 | 60.36 | 57.78 | 51.70 | 39.80 | 31.52 | 29.61 | 36.30 | 32.83 | 45.37 | 38.90 | | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 39.94 | 61.56 | 62.57 | 54.70 | 41.50 | 35.76 | 30.58 | 38.10 | 27.44 | 45.83 | 36.34 | | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 43.24 | 60.06 | 62.28 | 55.20 | 39.46 | 40.00 | 36.89 | 40.10 | 32.40 | 47.64 | 39.78 | | [StableToken](https://github.com/Tencent/StableToken) | **45.05** | 58.56 | 55.99 | 53.20 | 42.18 | 39.39 | 31.07 | 39.10 | 31.98 | 49.71 | 40.56 | | **Ours** | **45.05** | **70.27** | **67.96** | **61.10** (+5.90) | **45.24** | **43.64** | **40.29** | **45.80** (+5.70) | **35.54** | **52.07** | **43.54** (+2.98) | #### Controllable TTS Synthesis Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS). | Tokenizer | SIM (↑) | WER (↓) | MOS (↑) | |:---|:---:|:---:|:---:| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | .758 \| **.762** \| .760 | 2.71 \| 1.39 \| 2.05 | 3.75 \| 3.37 \| 3.56 | | **Ours** | **.792** \| .742 \| **.767** | **1.78** \| **1.29** \| **1.54** | **4.07** \| **3.68** \| **3.88** | ## Citation If you find our code or model useful for your research, please cite: ```bibtex @misc{song2026uniaudiotokenempoweringsemanticspeech, title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception}, author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou}, year={2026}, eprint={2605.31521}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.31521}, } ``` ## License This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE).