nielsr HF Staff

Add pipeline tag

d3c4965 verified 3 days ago

9.41 kB

	---
	language:
	- en
	- zh
	license: other
	license_name: license-term-of-universal-audio-tokenizer
	tags:
	- audio
	- audio-tokenizer
	- speech-tokenizer
	- speech
	- sound
	- music
	pipeline_tag: audio-to-audio
	---

	# Universal Audio Tokenizer: Empowering Semantic Speech Tokenizers with General Audio Perception

	Universal Audio Tokenizer (UniAudio-Token) is a compact single-codebook audio tokenizer that unifies general audio perception and linguistic alignment for downstream Audio-LLMs.

	📄 [Paper](https://arxiv.org/abs/2605.31521) \| 💻 [GitHub](https://github.com/Tencent/Universal_Audio_Tokenizer)

	For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Universal_Audio_Tokenizer).

	## 💡 Highlights

	Existing semantic speech tokenizers often suffer from acoustic blindness, while acoustic tokenizers typically lack linguistic alignment.

	Universal Audio Tokenizer bridges this gap through:
	- 🧩 Semantic-Acoustic Primitives (SAP) supervision that decomposes raw audio into fundamental linguistic content, vocal attributes, and auditory-scene primitives
	- ⚖️ Semantic-Acoustic Equilibrium (SAE) mechanism that adaptively injects fine-grained acoustic details from shallow encoder layers into deep semantic streams

	This results in a compact single-codebook audio tokenizer that simultaneously enables:
	* 🧠 Seamless LLM Integration: A unified audio input/output interface in Audio-LLMs
	* 🗣️ Linguistic Alignment: Superior performance on speech reconstruction and TTS synthesis tasks
	* 🎯 General Audio Perception: Discriminative representations for diverse audio events and strong performance on downstream audio understanding benchmarks

	## 📌 Model Details

	\| Attribute \| Value \|
	\|:----------\|:------\|
	\| Frame Rate \| 25 Hz \|
	\| Codebook Size \| 8,192 \|
	\| Bits Per Second (BPS) \| 325 \|

	## 🚀 Quick Start

	To use Universal Audio Tokenizer, please clone the official repository and install dependencies.

	### Installation

	```bash
	# 1. Clone the repository with all submodules
	git clone --recursive https://github.com/Tencent/Universal_Audio_Tokenizer.git
	cd Universal_Audio_Tokenizer

	# If you have already cloned the repository without --recursive,
	# initialize submodules with:
	git submodule update --init --recursive

	# 2. Create a conda environment
	conda create -n universal-audio-tokenizer python=3.10.13 -y
	conda activate universal-audio-tokenizer

	# 3. Install dependencies
	conda install -c conda-forge libsndfile -y
	pip install -r requirements.txt
	```

	### Download Pretrained Checkpoints

	Using `huggingface-cli`:

	```bash
	huggingface-cli download tencent/Universal_Audio_Tokenizer \
	--local-dir checkpoints/Universal_Audio_Tokenizer
	```

	Or using Python:

	```python
	from huggingface_hub import snapshot_download

	snapshot_download(
	repo_id="tencent/Universal_Audio_Tokenizer",
	local_dir="checkpoints/Universal_Audio_Tokenizer"
	)
	```

	### Run Inference

	We provide a simple inference demo in `example_usage.py`.

	```bash
	python example_usage.py \
	--device auto \
	--model_path checkpoints/Universal_Audio_Tokenizer \
	--audio_path /path/to/audio.wav
	```

	The script will:

	- load the tokenizer and feature extractor;
	- extract discrete audio tokens from input audio clips;
	- reconstruct waveforms from the tokens and save reconstructed audio under `reconstruction/`.

	Also, you can directly run the inference code snippet below:

	```python
	import os
	import torch
	from huggingface_hub import snapshot_download
	from transformers import WhisperFeatureExtractor
	from src.model.modeling_whisper import WhisperVQEncoder
	from src.model.flow_inference import AudioDecoder
	from src.model.utils import extract_audio_token, speech_token_to_wav

	# 1. Download & Load Models
	model_dir = snapshot_download("tencent/Universal_Audio_Tokenizer")

	# Load tokenizer and feature extractor
	tokenizer_path = os.path.join(model_dir, "tokenizer")
	tokenizer = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().cuda()
	feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path)

	# Load decoder
	decoder_path = os.path.join(model_dir, "decoder")
	decoder = AudioDecoder(
	config_path=os.path.join(decoder_path, "config.yaml"),
	flow_ckpt_path=os.path.join(decoder_path, "flow.pt"),
	hift_ckpt_path=os.path.join(decoder_path, "hift.pt"),
	device="cuda"
	)

	# 2. Tokenize
	tokens = extract_audio_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

	# 3. Reconstruct
	tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
	```

	## 📊 Performance

	Universal Audio Tokenizer learns discriminative representations for diverse audio events, and achieves strong performance on speech reconstruction, downstream audio understanding, and TTS synthesis tasks.

	### Latent Space Disentanglement

	We use high-dimensional token histogram vectors for cluster analysis. The results (Silhouette Score and Cluster Purity) show that our model effectively encodes general audio, with clearer cluster separation in the latent space.

	\| Model \| ESC-10 Sil. (↑) \| ESC-10 Purity (↑) \| ESC-50 Sil. (↑) \| ESC-50 Purity (↑) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|
	\| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) \| -0.030 \| 0.450 \| -0.108 \| 0.215 \|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| -0.182 \| 0.373 \| -0.304 \| 0.133 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| -0.016 \| 0.413 \| -0.100 \| 0.216 \|
	\| [StableToken](https://github.com/Tencent/StableToken) \| -0.035 \| 0.468 \| -0.096 \| 0.174 \|
	\| Ours \| 0.091 \| 0.730 \| 0.023 \| 0.390 \|

	### High-Quality Speech Reconstruction

	Our Universal Audio Tokenizer achieves high-quality speech reconstruction with a compact single-codebook design, significantly improving Word Error Rate (WER) and Mean Opinion Score (MOS) compared to existing supervised semantic tokenizers.

	\| Model \| Frame<br>Rate \| BPS \| WER (↓)<br>LS-clean \| WER (↓)<br>LS-other \| WER (↓)<br>SEED-en \| WER (↓)<br>SEED-zh \| MOS (↑)<br>LS-clean \| MOS (↑)<br>LS-other \| MOS (↑)<br>SEED-en \| MOS (↑)<br>SEED-zh \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) \| 75Hz \| 900 \| 5.07 \| 13.09 \| 5.60 \| 4.02 \| 3.37 \| 3.09 \| 3.01 \| 3.13 \|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 12.5Hz \| 175 \| 4.04 \| 9.33 \| 3.54 \| 3.23 \| 4.07 \| 3.99 \| 4.16 \| 4.10 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 325 \| 4.25 \| 9.68 \| 4.34 \| 2.75 \| 3.36 \| 3.25 \| 3.31 \| 3.58 \|
	\| [StableToken](https://github.com/Tencent/StableToken) \| 25Hz \| 325 \| 3.84 \| 7.99 \| 3.44 \| 2.62 \| 4.09 \| 3.83 \| 4.01 \| 4.18 \|
	\| Ours \| 25Hz \| 325 \| 3.47 \| 6.79 \| 2.55 \| 1.90 \| 4.19 \| 4.18 \| 4.13 \| 4.25 \|

	### Superior Downstream Audio-LLM Performance

	When integrated with the Qwen2.5 LLM backbone, our Universal Audio Tokenizer yields superior performance on a wide range of downstream audio understanding benchmarks and controllable TTS synthesis tasks.

	#### Audio Understanding

	Accuracy on audio understanding benchmarks:

	\| Tokenizer \| MMAU<br>(Speech) \| MMAU<br>(Sound) \| MMAU<br>(Music) \| MMAU<br>(Overall) \| MMAR<br>(Speech) \| MMAR<br>(Sound) \| MMAR<br>(Music) \| MMAR<br>(Overall) \| MMSU<br>(Perception) \| MMSU<br>(Reasoning) \| MMSU<br>(Overall) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) \| 36.94 \| 60.36 \| 57.78 \| 51.70 \| 39.80 \| 31.52 \| 29.61 \| 36.30 \| 32.83 \| 45.37 \| 38.90 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 39.94 \| 61.56 \| 62.57 \| 54.70 \| 41.50 \| 35.76 \| 30.58 \| 38.10 \| 27.44 \| 45.83 \| 36.34 \|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 43.24 \| 60.06 \| 62.28 \| 55.20 \| 39.46 \| 40.00 \| 36.89 \| 40.10 \| 32.40 \| 47.64 \| 39.78 \|
	\| [StableToken](https://github.com/Tencent/StableToken) \| 45.05 \| 58.56 \| 55.99 \| 53.20 \| 42.18 \| 39.39 \| 31.07 \| 39.10 \| 31.98 \| 49.71 \| 40.56 \|
	\| Ours \| 45.05 \| 70.27 \| 67.96 \| 61.10 (+5.90) \| 45.24 \| 43.64 \| 40.29 \| 45.80 (+5.70) \| 35.54 \| 52.07 \| 43.54 (+2.98) \|

	#### Controllable TTS Synthesis

	Results on SEED-TTS, measured by speaker similarity (SIM), word error rate (WER), and mean opinion score (MOS).

	\| Tokenizer \| SIM (↑) \| WER (↓) \| MOS (↑) \|
	\|:---\|:---:\|:---:\|:---:\|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| .758 \\| .762 \\| .760 \| 2.71 \\| 1.39 \\| 2.05 \| 3.75 \\| 3.37 \\| 3.56 \|
	\| Ours \| .792 \\| .742 \\| .767 \| 1.78 \\| 1.29 \\| 1.54 \| 4.07 \\| 3.68 \\| 3.88 \|

	## Citation

	If you find our code or model useful for your research, please cite:

	```bibtex
	@misc{song2026uniaudiotokenempoweringsemanticspeech,
	title={UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception},
	author={Yuhan Song and Linhao Zhang and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
	year={2026},
	eprint={2605.31521},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2605.31521},
	}
	```

	## License

	This project is licensed under the [License Term of Universal_Audio_Tokenizer](LICENSE).