zuhri025 commited on
Commit
be2d387
·
verified ·
1 Parent(s): 0539bd6

Initial text-based TTS tokenizer

Browse files
Files changed (3) hide show
  1. README.md +2 -2
  2. tokenizer.json +0 -0
  3. vocab.json +0 -0
README.md CHANGED
@@ -4,7 +4,7 @@ This tokenizer is designed for text-based Text-to-Speech models.
4
 
5
  ## Vocabulary Structure
6
 
7
- - **Total Size**: 13,016
8
  - **Text Tokens**: ~8,000 (BPE-trained)
9
  - **Audio Codes**: 12,801 (`<|code_0|>` to `<|code_12800|>`)
10
  - **Special Tokens**: 22
@@ -54,7 +54,7 @@ BPE subword units trained on English text:
54
 
55
  - **Tokenizer Type**: BPE (Byte-Pair Encoding)
56
  - **Text Vocab Size**: 8,000
57
- - **Training Data**: 1,000,000+ text samples
58
  - **Min Frequency**: 2
59
 
60
  ## Model Compatibility
 
4
 
5
  ## Vocabulary Structure
6
 
7
+ - **Total Size**: 20,831
8
  - **Text Tokens**: ~8,000 (BPE-trained)
9
  - **Audio Codes**: 12,801 (`<|code_0|>` to `<|code_12800|>`)
10
  - **Special Tokens**: 22
 
54
 
55
  - **Tokenizer Type**: BPE (Byte-Pair Encoding)
56
  - **Text Vocab Size**: 8,000
57
+ - **Training Data**: 100,000+ text samples
58
  - **Min Frequency**: 2
59
 
60
  ## Model Compatibility
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
vocab.json CHANGED
The diff for this file is too large to render. See raw diff