zuhri025
/

text-tts-tokenizer

Model card Files Files and versions

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Text-Based TTS Tokenizer

This tokenizer is designed for text-based Text-to-Speech models.

Vocabulary Structure

Total Size: 20,831
Text Tokens: ~8,000 (BPE-trained)
Audio Codes: 12,801 (<|code_0|> to <|code_12800|>)
Special Tokens: 22

Tokenization Method

BPE (Byte-Pair Encoding) trained on text data from:

humair025/ultra-unified

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zuhri025/text-tts-tokenizer")

# Encode text
text = "Hello world"
tokens = tokenizer.encode(text)

# Decode
decoded = tokenizer.decode(tokens)

Vocabulary Breakdown

Special Tokens (22)

<|pad|>, <|unk|>, <|bos|>, <|eos|>
<|start_of_speech|>, <|end_of_speech|>, <|speech|>
<|start_of_text|>, <|end_of_text|>
<|accent_us|>, <|accent_uk|>
<|male|>, <|female|>, <|other|>
Voice tokens, conversational tokens, etc.

Text Tokens (~8,000)

BPE subword units trained on English text:

Common words (e.g., "the", "and", "is")
Subwords (e.g., "ing", "tion", "er")
Byte-level fallback for any character

Audio Tokens (12,801)

<|code_0|> to <|code_12800|>
Represent audio codec token indices
Used in <|speech|> ... <|end_of_speech|> sections

Training Details

Tokenizer Type: BPE (Byte-Pair Encoding)
Text Vocab Size: 8,000
Training Data: 100,000+ text samples
Min Frequency: 2

Model Compatibility

This tokenizer is designed for:

Text-based TTS training
Models using BPE subword tokenization
Audio codec models with 12,801 codes

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support