license: apache-2.0
pipeline_tag: audio-to-audio
XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
This repository contains the model presented in the paper XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs.
The official code is available at https://github.com/gyt1145028706/XY-Tokenizer.
Overview π
XY-Tokenizer is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously modeling both semantic and acoustic information. It operates at a bitrate of 1 kbps (1000 bps), using 8-layer Residual Vector Quantization (RVQ8) at a 12.5 Hz frame rate.
At this ultra-low bitrate, XY-Tokenizer achieves performance comparable to state-of-the-art speech codecs that focus on only one aspectβeither semantic or acousticβwhile XY-Tokenizer performs strongly on both. For detailed information about the model and demos, please refer to our paper.
Highlights β¨
Low frame rate, low bitrate with high fidelity and text alignment: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
Multilingual training on the full Emilia dataset: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
Designed for Speech LLMs: Can be used for zero-shot TTS, dialogue TTS (e.g., MOSS-TTSD), and speech large language models.
News π’
- [2025-06-28] We released the code and checkpoints of XY-Tokenizer. Check out our paper and see the paper for demos!
Installation π οΈ
To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.
Using conda
# Clone repository
git clone git@github.com:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer
# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
# Install dependencies
pip install -r requirements.txt
Available Models ποΈ
| Model Name | Hugging Face | Training Data |
|---|---|---|
| XY-Tokenizer | π€ | Emilia |
| XY-Tokenizer-TTSD-V0 (used in MOSS-TTSD) | π€ | Emilia + Internal Data (containing general audio) |
Usage π
Download XY Tokenizer
You need to download the XY Tokenizer model weights. You can find the weights in the XY_Tokenizer Hugging Face repository.
mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/
Local Inference
First, set the Python path to include this repository:
export PYTHONPATH=$PYTHONPATH:./
Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:
python inference.py
The reconstructed audio files will be available in the output_wavs/ directory.
License π
XY-Tokenizer is released under the Apache 2.0 license.
Citation π
@misc{gong2025xytokenizermitigatingsemanticacousticconflict,
title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs},
author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
year={2025},
eprint={2506.23325},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2506.23325},
}