Enhance model card for XY-Tokenizer with metadata and content

This PR enhances the model card for the XY-Tokenizer model. It adds the `pipeline_tag: audio-to-audio` to the metadata, improving discoverability on the Hub.

The content now includes comprehensive information ported from the project's GitHub README, such as an overview, highlights, installation instructions, usage examples, and citation details. Links to the official paper on Hugging Face and the GitHub repository are also provided for easy access to relevant resources. Placeholder links for demos/blogs have been updated to point to the paper URL for consolidated information.

Files changed (1) hide show

README.md +100 -3

README.md CHANGED Viewed

@@ -1,3 +1,100 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: audio-to-audio
+---
+# XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
+This repository contains the model presented in the paper [XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs](https://huggingface.co/papers/2506.23325).
+The official code is available at [https://github.com/gyt1145028706/XY-Tokenizer](https://github.com/gyt1145028706/XY-Tokenizer).
+## Overview 🔍
+**XY-Tokenizer** is a novel speech codec designed to bridge the gap between speech signals and large language models by simultaneously **modeling both semantic and acoustic information**. It operates at a bitrate of **1 kbps** (1000 bps), using **8-layer Residual Vector Quantization (RVQ8)** at a **12.5 Hz** frame rate.
+At this ultra-low bitrate, **XY-Tokenizer** achieves performance comparable to state-of-the-art speech codecs that focus on only one aspect—either semantic or acoustic—while **XY-Tokenizer** performs strongly on both. For detailed information about the model and demos, please refer to our [paper](https://huggingface.co/papers/2506.23325).
+## Highlights ✨
+-   **Low frame rate, low bitrate with high fidelity and text alignment**: Achieves strong semantic alignment and acoustic quality at 12.5Hz and 1kbps.
+-   **Multilingual training on the full Emilia dataset**: Trained on a large-scale multilingual dataset, supporting robust performance across diverse languages.
+-   **Designed for Speech LLMs**: Can be used for zero-shot TTS, dialogue TTS (e.g., [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)), and speech large language models.
+<div align="center">
+    <p>
+    <img src="https://huggingface.co/fdugyt/XY_Tokenizer/resolve/main/assets/XY-Tokenizer-Architecture.png" alt="XY-Tokenizer" width="1000">
+    </p>
+</div>
+## News 📢
+-   **[2025-06-28]** We released the code and checkpoints of XY-Tokenizer. Check out our [paper](https://huggingface.co/papers/2506.23325) and see the paper for demos!
+## Installation 🛠️
+To use XY-Tokenizer, you need to install the required dependencies. You can use either pip or conda to set up your environment.
+### Using conda
+```bash
+# Clone repository
+git clone git@github.com:gyt1145028706/XY-Tokenizer.git && cd XY-Tokenizer
+# Create and activate conda environment
+conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
+# Install dependencies
+pip install -r requirements.txt
+```
+## Available Models 🗂️
+| Model Name | Hugging Face | Training Data |
+|:----------:|:-------------:|:---------------:|
+| XY-Tokenizer | [🤗](https://huggingface.co/fdugyt/XY_Tokenizer) | Emilia |
+| XY-Tokenizer-TTSD-V0 (used in [MOSS-TTSD](https://github.com/OpenMOSS/MOSS-TTSD)) | [🤗](https://huggingface.co/fnlp/XY_Tokenizer_TTSD_V0/) | Emilia + Internal Data (containing general audio) |
+## Usage 🚀
+### Download XY Tokenizer
+You need to download the XY Tokenizer model weights. You can find the weights in the [XY_Tokenizer Hugging Face repository](https://huggingface.co/fdugyt/XY_Tokenizer).
+```bash
+mkdir -p ./weights && huggingface-cli download fdugyt/XY_Tokenizer xy_tokenizer.ckpt --local-dir ./weights/
+```
+### Local Inference
+First, set the Python path to include this repository:
+```bash
+export PYTHONPATH=$PYTHONPATH:./
+```
+Then you can tokenize audio to speech tokens and generate reconstructed audio from these tokens by running:
+```python
+python inference.py
+```
+The reconstructed audio files will be available in the `output_wavs/` directory.
+## License 📜
+XY-Tokenizer is released under the Apache 2.0 license.
+## Citation 📚
+```bibtex
+@misc{gong2025xytokenizermitigatingsemanticacousticconflict,
+      title={XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs},
+      author={Yitian Gong and Luozhijie Jin and Ruifan Deng and Dong Zhang and Xin Zhang and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Xipeng Qiu},
+      year={2025},
+      eprint={2506.23325},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD},
+      url={https://arxiv.org/abs/2506.23325},
+}
+```