refine readme, add logo, and fix a punct normalization problem in tn

bdecca1 5 months ago

1.11 kB

	## Speech Semantic Tokenizer
	As illustrated below, this tokenizer is trained using a supervised learning method. The phoneme sequences corresponding to the text are used as labels, and the grapheme-to-phoneme (G2P) conversion module is located in `thirdparty/G2P`. The tokenizer was trained on roughly 4,000 hours of speech-text data in Chinese and English, which was sampled from open-source datasets. The ratio between the two languages was 1:1. The speech encoder is a `hubert-large` model trained on about 450K hours of unlabeled speech data with the recipe provided by [fairseq](https://github.com/facebookresearch/fairseq). On the other hand, our decoder is relatively simple, consisting of only four CNN layers. We believe that a simple and weak decoder is the key to training the tokenizer.
	<p align="center"><img src="../../figs/tokenizer.jpg" width="800"></p>


	To run this semantic tokenizer alone, the required packages should be installed.
	```bash
	# install requirements for this semantic tokenizer on Ascend 910B
	# for GPUs, just remove torch-npu==2.5.1
	pip install -r requirements_npu.txt
	```