Initial text-based TTS tokenizer

Files changed (3) hide show

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ This tokenizer is designed for text-based Text-to-Speech models.
 ## Vocabulary Structure
-- **Total Size**: 13,016
 - **Text Tokens**: ~8,000 (BPE-trained)
 - **Audio Codes**: 12,801 (`<|code_0|>` to `<|code_12800|>`)
 - **Special Tokens**: 22
@@ -54,7 +54,7 @@ BPE subword units trained on English text:
 - **Tokenizer Type**: BPE (Byte-Pair Encoding)
 - **Text Vocab Size**: 8,000
-- **Training Data**: 1,000,000+ text samples
 - **Min Frequency**: 2
 ## Model Compatibility

 ## Vocabulary Structure
+- **Total Size**: 20,831
 - **Text Tokens**: ~8,000 (BPE-trained)
 - **Audio Codes**: 12,801 (`<|code_0|>` to `<|code_12800|>`)
 - **Special Tokens**: 22
 - **Tokenizer Type**: BPE (Byte-Pair Encoding)
 - **Text Vocab Size**: 8,000
+- **Training Data**: 100,000+ text samples
 - **Min Frequency**: 2
 ## Model Compatibility

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff