nielsr HF Staff commited on
Commit
3b704cb
·
verified ·
1 Parent(s): a643119

Enhance model card for XY-Tokenizer with metadata and content

Browse files

This PR enhances the model card for the XY-Tokenizer model. It adds the `pipeline_tag: audio-to-audio` to the metadata, improving discoverability on the Hub.

The content now includes comprehensive information ported from the project's GitHub README, such as an overview, highlights, installation instructions, usage examples, and citation details. Links to the official paper on Hugging Face and the GitHub repository are also provided for easy access to relevant resources. Placeholder links for demos/blogs have been updated to point to the paper URL for consolidated information.

Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: audio-to-audio
4
+ ---
5
+
6
+ # XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
7
+
8
+ This repository contains the model presented in the paper [XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs](https://huggingface.co/papers/2506.23325).
9
+
10
+ The official code is available at [https://github.com/gyt1145028706/XY-Tokenizer](https://github.com/gyt1145028706/XY-Tokenizer).
11
+
12
+ ## Overview 🔍
13
+
14
+ **XY-Tokenizer** is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously **modeling both semantic and acoustic information**. It operates at a bitrate of **1 kbps** (1000 bps), using **8-layer Residual Vector Quantization (RVQ8)** at a **12.5 Hz** frame rate.
15
+
16
+ At this ultra-low bitrate, **XY-Tokenizer** achieves performance comparable to state-of-the-art speech codecs that focus on only one aspect—either semantic or acoustic—while **XY-Tokenizer** performs strongly on both. For detailed information about the model and demos, please refer to our [paper](https://huggingface.co/papers/2506.23325).
17
+
18
+ ## Highlights ✨
19
+
20
+ - **Low frame rate, low bitrate with high fidelity and text alignment**: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
21
+
22
+ - **Multilingual training on the full Emilia dataset**: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
23
+
24
+ - **Designed for Speech LLMs**: Can be used for zero-shot TTS, dialogue TTS (e.g., [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)), and speech large language models.
25
+
26
+ <div align="center">
27
+ <p>
28
+ <img src="https://huggingface.co/fdugyt/XY_Tokenizer/resolve/main/assets/XY-Tokenizer-Architecture.png" alt="XY-Tokenizer" width="1000">
29
+ </p>
30
+ </div>
31
+
32
+ ## News 📢
33
+
34
+ - **[2025-06-28]** We released the code and checkpoints of XY-Tokenizer. Check out our [paper](https://huggingface.co/papers/2506.23325) and see the paper for demos!
35
+
36
+ ## Installation 🛠️
37
+
38
+ To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.
39
+
40
+ ### Using conda
41
+
42
+ ```bash
43
+ # Clone repository
44
+ git clone git@github.com:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer
45
+
46
+ # Create and activate conda environment
47
+ conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
48
+
49
+ # Install dependencies
50
+ pip install -r requirements.txt
51
+ ```
52
+
53
+ ## Available Models 🗂️
54
+
55
+ | Model Name | Hugging Face | Training Data |
56
+ |:----------:|:-------------:|:---------------:|
57
+ | XY-Tokenizer | [🤗](https://huggingface.co/fdugyt/XY_Tokenizer) | Emilia |
58
+ | XY-Tokenizer-TTSD-V0 (used in [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)) | [🤗](https://huggingface.co/fnlp/XY_Tokenizer_TTSD_V0/) | Emilia + Internal Data (containing general audio) |
59
+
60
+ ## Usage 🚀
61
+
62
+ ### Download XY Tokenizer
63
+
64
+ You need to download the XY Tokenizer model weights. You can find the weights in the [XY_Tokenizer Hugging Face repository](https://huggingface.co/fdugyt/XY_Tokenizer).
65
+
66
+ ```bash
67
+ mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/
68
+ ```
69
+
70
+ ### Local Inference
71
+
72
+ First, set the Python path to include this repository:
73
+ ```bash
74
+ export PYTHONPATH=$PYTHONPATH:./
75
+ ```
76
+
77
+ Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:
78
+ ```python
79
+ python inference.py
80
+ ```
81
+
82
+ The reconstructed audio files will be available in the `output_wavs/` directory.
83
+
84
+ ## License 📜
85
+
86
+ XY-Tokenizer is released under the Apache 2.0 license.
87
+
88
+ ## Citation 📚
89
+
90
+ ```bibtex
91
+ @misc{gong2025xytokenizermitigatingsemanticacousticconflict,
92
+ title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs},
93
+ author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
94
+ year={2025},
95
+ eprint={2506.23325},
96
+ archivePrefix={arXiv},
97
+ primaryClass={cs.SD},
98
+ url={https://arxiv.org/abs/2506.23325},
99
+ }
100
+ ```