YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Text-Based TTS Tokenizer

This tokenizer is designed for text-based Text-to-Speech models.

Vocabulary Structure

  • Total Size: 20,831
  • Text Tokens: ~8,000 (BPE-trained)
  • Audio Codes: 12,801 (<|code_0|> to <|code_12800|>)
  • Special Tokens: 22

Tokenization Method

BPE (Byte-Pair Encoding) trained on text data from:

  • humair025/ultra-unified

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zuhri025/text-tts-tokenizer")

# Encode text
text = "Hello world"
tokens = tokenizer.encode(text)

# Decode
decoded = tokenizer.decode(tokens)

Vocabulary Breakdown

Special Tokens (22)

  • <|pad|>, <|unk|>, <|bos|>, <|eos|>
  • <|start_of_speech|>, <|end_of_speech|>, <|speech|>
  • <|start_of_text|>, <|end_of_text|>
  • <|accent_us|>, <|accent_uk|>
  • <|male|>, <|female|>, <|other|>
  • Voice tokens, conversational tokens, etc.

Text Tokens (~8,000)

BPE subword units trained on English text:

  • Common words (e.g., "the", "and", "is")
  • Subwords (e.g., "ing", "tion", "er")
  • Byte-level fallback for any character

Audio Tokens (12,801)

  • <|code_0|> to <|code_12800|>
  • Represent audio codec token indices
  • Used in <|speech|> ... <|end_of_speech|> sections

Training Details

  • Tokenizer Type: BPE (Byte-Pair Encoding)
  • Text Vocab Size: 8,000
  • Training Data: 100,000+ text samples
  • Min Frequency: 2

Model Compatibility

This tokenizer is designed for:

  • Text-based TTS training
  • Models using BPE subword tokenization
  • Audio codec models with 12,801 codes
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support