YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Text-Based TTS Tokenizer
This tokenizer is designed for text-based Text-to-Speech models.
Vocabulary Structure
- Total Size: 20,831
- Text Tokens: ~8,000 (BPE-trained)
- Audio Codes: 12,801 (
<|code_0|>to<|code_12800|>) - Special Tokens: 22
Tokenization Method
BPE (Byte-Pair Encoding) trained on text data from:
- humair025/ultra-unified
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("zuhri025/text-tts-tokenizer")
# Encode text
text = "Hello world"
tokens = tokenizer.encode(text)
# Decode
decoded = tokenizer.decode(tokens)
Vocabulary Breakdown
Special Tokens (22)
<|pad|>,<|unk|>,<|bos|>,<|eos|><|start_of_speech|>,<|end_of_speech|>,<|speech|><|start_of_text|>,<|end_of_text|><|accent_us|>,<|accent_uk|><|male|>,<|female|>,<|other|>- Voice tokens, conversational tokens, etc.
Text Tokens (~8,000)
BPE subword units trained on English text:
- Common words (e.g., "the", "and", "is")
- Subwords (e.g., "ing", "tion", "er")
- Byte-level fallback for any character
Audio Tokens (12,801)
<|code_0|>to<|code_12800|>- Represent audio codec token indices
- Used in
<|speech|>...<|end_of_speech|>sections
Training Details
- Tokenizer Type: BPE (Byte-Pair Encoding)
- Text Vocab Size: 8,000
- Training Data: 100,000+ text samples
- Min Frequency: 2
Model Compatibility
This tokenizer is designed for:
- Text-based TTS training
- Models using BPE subword tokenization
- Audio codec models with 12,801 codes
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support