DiscreteSpeech
/

DSTK

Model card Files Files and versions

gooorillax commited on Sep 24, 2025

Commit

fbe77ca

·

1 Parent(s): 6aff4ff

fix typos

Files changed (1) hide show

README.md +4 -5

README.md CHANGED Viewed

@@ -25,10 +25,11 @@ V1.0
 This release of DSTK includes three modules：
 1. Semantic Tokenzier
    - Encode the semantic information of speech into discrete speech tokens.
-   - frame rate: 25Hz; codebook size: 4096，supports both Chinese and English
 2. Semantic Detokenizer
    - Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
-   - Supports both Chinese and English
 3. Text2token (T2U)
    - Convert text content into speech tokens
@@ -40,7 +41,7 @@ As shown in the figure below, the 3 module could form a pipeline for TTS task.
 As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
-These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and less supervised data for training:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
@@ -77,8 +78,6 @@ sh thirdparty/G2P/patch_for_deps.sh
 ## Usage:
 ### Pipelines
 ```python
 import sys
 import soundfile as sf

 This release of DSTK includes three modules：
 1. Semantic Tokenzier
    - Encode the semantic information of speech into discrete speech tokens.
+   - frame rate: 25Hz; codebook size: 4096;
+   - Support both Chinese and English
 2. Semantic Detokenizer
    - Decode the discrete speech tokens into audible speech waveforms to reconstruct the speech
+   - Support both Chinese and English
 3. Text2token (T2U)
    - Convert text content into speech tokens
 As shown in figure below, the tokenizer and detokenizer could also form a pipeline for speech reconstruction task.
 <p align="center"><img src="figs/reconstruction.jpg" width="1200"></p>
+These pipelines achieved top-tier performance on TTS and speech reconstruction on the seed-tts-eval dataset, with less parameters and much less supervised data for training:
 <p align="center"><img src="figs/eval1.jpg" width="1200"></p>
 <p align="center"><img src="figs/eval2.jpg" width="1200"></p>
 ## Usage:
 ### Pipelines
 ```python
 import sys
 import soundfile as sf