nielsr HF Staff

Add pipeline tag and link to paper

e956ae6 verified 27 days ago

8.22 kB

	---
	language:
	- en
	- zh
	license: other
	license_name: license-term-of-stabletoken
	tags:
	- speech tokenizer
	pipeline_tag: audio-to-audio
	---

	# StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

	StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

	📄 [Paper](https://huggingface.co/papers/2509.22220) \| 💻 [GitHub](https://github.com/Tencent/StableToken)

	For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken).

	## Model Details

	\| Attribute \| Value \|
	\|:----------\|:------\|
	\| Frame Rate \| 25 Hz \|
	\| Codebook Size \| 8,192 \|
	\| BPS (Bits Per Second) \| 325 \|

	## Quick Start

	To use StableToken, please clone the official repository and install dependencies.

	### Installation

	```bash
	git clone --recursive https://github.com/Tencent/StableToken.git
	cd StableToken && pip install -r requirements.txt
	```

	### Inference

	```python
	import os
	from huggingface_hub import snapshot_download
	from transformers import WhisperFeatureExtractor
	from src.model.modeling_whisper import WhisperLFQEncoder
	from src.utils.flow_inference import AudioDecoder
	from src.utils.utils import extract_speech_token, speech_token_to_wav

	# 1. Download & Load Models
	model_dir = snapshot_download("tencent/StableToken")

	# Load Tokenizer
	tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
	feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

	# Load Decoder
	decoder = AudioDecoder(
	config_path=os.path.join(model_dir, "decoder", "config.yaml"),
	flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
	hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
	device="cuda"
	)

	# 2. Tokenize
	tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

	# 3. Reconstruct
	tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
	```

	## Performance

	StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

	### Noise Robustness (UED ↓)

	\| Model \| Frame Rate \| Codebook Size \| UED (%, ↓) \|
	\|:---\|:---:\|:---:\|:---:\|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 12.5Hz \| 16,384 \| 31.10 \|
	\| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 4,096 \| 26.17 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 6,561 \| 38.66 \|
	\| StableToken \| 25Hz \| 8,192 \| 10.17 🏆 \|

	### Reconstruction Quality

	Measurements on LibriSpeech (LS) and SEED benchmarks.

	\| Model \| Frame<br>Rate \| BPS \| WER (↓)<br>LS-clean \| WER (↓)<br>LS-other \| WER (↓)<br>SEED-en \| WER (↓)<br>SEED-zh \| MOS (↑)<br>LS-clean \| MOS (↑)<br>LS-other \| MOS (↑)<br>SEED-en \| MOS (↑)<br>SEED-zh \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 12.5Hz \| 175 \| 4.04 \| 9.33 \| 3.54 \| 3.23 \| 4.07 \| 3.99 \| 4.16 \| 4.10 \|
	\| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 300 \| 5.78 \| 13.38 \| 5.91 \| 4.26 \| 3.40 \| 3.31 \| 3.40 \| 3.31 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 325 \| 4.25 \| 9.68 \| 4.34 \| 2.75 \| 3.36 \| 3.25 \| 3.31 \| 3.58 \|
	\| StableToken \| 25Hz \| 325 \| 3.84 \| 7.99 \| 3.44 \| 2.62 \| 4.09 \| 3.83 \| 4.01 \| 4.18 \|

	## Citation

	```bibtex
	@article{song2025stabletoken,
	title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
	author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL .

	# Current model card

	The README of the model repository currently looks like this:

	## Metadata
	```yaml
	language:
	- en
	- zh
	license: other
	license_name: license-term-of-stabletoken
	tags:
	- speech tokenizer
	```

	## Content
	# StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026)

	StableToken is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments.

	📄 [Paper](https://arxiv.org/abs/2509.22220) \| 💻 [GitHub](https://github.com/Tencent/StableToken)

	For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken).

	## Model Details

	\| Attribute \| Value \|
	\|:----------\|:------\|
	\| Frame Rate \| 25 Hz \|
	\| Codebook Size \| 8,192 \|
	\| BPS (Bits Per Second) \| 325 \|

	## Quick Start

	To use StableToken, please clone the official repository and install dependencies.

	### Installation

	```bash
	git clone --recursive https://github.com/Tencent/StableToken.git
	cd StableToken && pip install -r requirements.txt
	```

	### Inference

	```python
	import os
	from huggingface_hub import snapshot_download
	from transformers import WhisperFeatureExtractor
	from src.model.modeling_whisper import WhisperLFQEncoder
	from src.utils.flow_inference import AudioDecoder
	from src.utils.utils import extract_speech_token, speech_token_to_wav

	# 1. Download & Load Models
	model_dir = snapshot_download("tencent/StableToken")

	# Load Tokenizer
	tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
	feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))

	# Load Decoder
	decoder = AudioDecoder(
	config_path=os.path.join(model_dir, "decoder", "config.yaml"),
	flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
	hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
	device="cuda"
	)

	# 2. Tokenize
	tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0]

	# 3. Reconstruct
	tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)
	```

	## Performance

	StableToken achieves 60% lower UED (Unit Edit Distance) than best existing supervised semantic tokenizers.

	### Noise Robustness (UED ↓)

	\| Model \| Frame Rate \| Codebook Size \| UED (%, ↓) \|
	\|:---\|:---:\|:---:\|:---:\|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 12.5Hz \| 16,384 \| 31.10 \|
	\| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 4,096 \| 26.17 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 6,561 \| 38.66 \|
	\| StableToken \| 25Hz \| 8,192 \| 10.17 🏆 \|

	### Reconstruction Quality

	Measurements on LibriSpeech (LS) and SEED benchmarks.

	\| Model \| Frame<br>Rate \| BPS \| WER (↓)<br>LS-clean \| WER (↓)<br>LS-other \| WER (↓)<br>SEED-en \| WER (↓)<br>SEED-zh \| MOS (↑)<br>LS-clean \| MOS (↑)<br>LS-other \| MOS (↑)<br>SEED-en \| MOS (↑)<br>SEED-zh \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) \| 12.5Hz \| 175 \| 4.04 \| 9.33 \| 3.54 \| 3.23 \| 4.07 \| 3.99 \| 4.16 \| 4.10 \|
	\| [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 300 \| 5.78 \| 13.38 \| 5.91 \| 4.26 \| 3.40 \| 3.31 \| 3.40 \| 3.31 \|
	\| [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) \| 25Hz \| 325 \| 4.25 \| 9.68 \| 4.34 \| 2.75 \| 3.36 \| 3.25 \| 3.31 \| 3.58 \|
	\| StableToken \| 25Hz \| 325 \| 3.84 \| 7.99 \| 3.44 \| 2.62 \| 4.09 \| 3.83 \| 4.01 \| 4.18 \|

	## Citation

	```bibtex
	@article{song2025stabletoken,
	title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
	author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao},
	journal={arXiv preprint arXiv:2509.22220},
	year={2025}
	}
	```

	## License

	This project is licensed under the [License Term of StableToken](LICENSE).
	```