| ## Speech Semantic Tokenizer | |
| As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer. | |
| <p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p> | |
| To run this semantic tokenizer alone, the required packages should be installed. | |
| ```bash | |
| # install requirements for this semantic tokenizer on Ascend 910B | |
| # for GPUs, just remove torch-npu==2.5.1 | |
| pip install -r requirements_npu.txt | |
| ``` |