| --- |
| language: |
| - en |
| - zh |
| license: other |
| license_name: license-term-of-stabletoken |
| tags: |
| - speech tokenizer |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026) |
|
|
| **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments. |
|
|
| π [Paper](https://huggingface.co/papers/2509.22220) | π» [GitHub](https://github.com/Tencent/StableToken) |
|
|
| For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken). |
|
|
| ## Model Details |
|
|
| | Attribute | Value | |
| |:----------|:------| |
| | Frame Rate | 25 Hz | |
| | Codebook Size | 8,192 | |
| | BPS (Bits Per Second) | 325 | |
|
|
| ## Quick Start |
|
|
| To use StableToken, please clone the official repository and install dependencies. |
|
|
| ### Installation |
|
|
| ```bash |
| git clone --recursive https://github.com/Tencent/StableToken.git |
| cd StableToken && pip install -r requirements.txt |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import os |
| from huggingface_hub import snapshot_download |
| from transformers import WhisperFeatureExtractor |
| from src.model.modeling_whisper import WhisperLFQEncoder |
| from src.utils.flow_inference import AudioDecoder |
| from src.utils.utils import extract_speech_token, speech_token_to_wav |
| |
| # 1. Download & Load Models |
| model_dir = snapshot_download("tencent/StableToken") |
| |
| # Load Tokenizer |
| tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda() |
| feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer")) |
| |
| # Load Decoder |
| decoder = AudioDecoder( |
| config_path=os.path.join(model_dir, "decoder", "config.yaml"), |
| flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"), |
| hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"), |
| device="cuda" |
| ) |
| |
| # 2. Tokenize |
| tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] |
| |
| # 3. Reconstruct |
| tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) |
| ``` |
|
|
| ## Performance |
|
|
| StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers. |
|
|
| ### Noise Robustness (UED β) |
|
|
| | Model | Frame Rate | Codebook Size | UED (%, β) | |
| |:---|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 | |
| | **StableToken** | 25Hz | 8,192 | **10.17** π | |
|
|
| ### Reconstruction Quality |
|
|
| Measurements on LibriSpeech (LS) and SEED benchmarks. |
|
|
| | Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | |
| | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{song2025stabletoken, |
| title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs}, |
| author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan bitwise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at this https URL . |
| |
| # Current model card |
| |
| The README of the model repository currently looks like this: |
| |
| ## Metadata |
| ```yaml |
| language: |
| - en |
| - zh |
| license: other |
| license_name: license-term-of-stabletoken |
| tags: |
| - speech tokenizer |
| ``` |
| |
| ## Content |
| # StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs (ICLR 2026) |
| |
| **StableToken** is a noise-robust semantic speech tokenizer that performs discrete speech representation learning, achieving state-of-the-art stability in noisy environments. |
| |
| π [Paper](https://arxiv.org/abs/2509.22220) | π» [GitHub](https://github.com/Tencent/StableToken) |
| |
| For code and more detailed information, please refer to the corresponding [GitHub repository](https://github.com/Tencent/StableToken). |
| |
| ## Model Details |
| |
| | Attribute | Value | |
| |:----------|:------| |
| | Frame Rate | 25 Hz | |
| | Codebook Size | 8,192 | |
| | BPS (Bits Per Second) | 325 | |
| |
| ## Quick Start |
| |
| To use StableToken, please clone the official repository and install dependencies. |
| |
| ### Installation |
| |
| ```bash |
| git clone --recursive https://github.com/Tencent/StableToken.git |
| cd StableToken && pip install -r requirements.txt |
| ``` |
| |
| ### Inference |
| |
| ```python |
| import os |
| from huggingface_hub import snapshot_download |
| from transformers import WhisperFeatureExtractor |
| from src.model.modeling_whisper import WhisperLFQEncoder |
| from src.utils.flow_inference import AudioDecoder |
| from src.utils.utils import extract_speech_token, speech_token_to_wav |
|
|
| # 1. Download & Load Models |
| model_dir = snapshot_download("tencent/StableToken") |
|
|
| # Load Tokenizer |
| tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda() |
| feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer")) |
| |
| # Load Decoder |
| decoder = AudioDecoder( |
| config_path=os.path.join(model_dir, "decoder", "config.yaml"), |
| flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"), |
| hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"), |
| device="cuda" |
| ) |
| |
| # 2. Tokenize |
| tokens = extract_speech_token(tokenizer, feature_extractor, ["/path/to/audio.wav"], device="cuda")[0] |
| |
| # 3. Reconstruct |
| tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens) |
| ``` |
| |
| ## Performance |
| |
| StableToken achieves **60% lower UED** (Unit Edit Distance) than best existing supervised semantic tokenizers. |
| |
| ### Noise Robustness (UED β) |
| |
| | Model | Frame Rate | Codebook Size | UED (%, β) | |
| |:---|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 16,384 | 31.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 4,096 | 26.17 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 6,561 | 38.66 | |
| | **StableToken** | 25Hz | 8,192 | **10.17** π | |
| |
| ### Reconstruction Quality |
| |
| Measurements on LibriSpeech (LS) and SEED benchmarks. |
| |
| | Model | Frame<br>Rate | BPS | WER (β)<br>LS-clean | WER (β)<br>LS-other | WER (β)<br>SEED-en | WER (β)<br>SEED-zh | MOS (β)<br>LS-clean | MOS (β)<br>LS-other | MOS (β)<br>SEED-en | MOS (β)<br>SEED-zh | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | [GLM-4-Voice-Tokenizer](https://github.com/zai-org/GLM-4-Voice) | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | **3.99** | **4.16** | 4.10 | |
| | [S3 Tokenizer](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 | |
| | [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 | |
| | **StableToken** | 25Hz | 325 | **3.84** | **7.99** | **3.44** | **2.62** | **4.09** | 3.83 | 4.01 | **4.18** | |
| |
| ## Citation |
| |
| ```bibtex |
| @article{song2025stabletoken, |
| title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs}, |
| author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao}, |
| journal={arXiv preprint arXiv:2509.22220}, |
| year={2025} |
| } |
| ``` |
| |
| ## License |
| |
| This project is licensed under the [License Term of StableToken](LICENSE). |
| ``` |