YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bilingual TTS Tokenizer (English + Urdu)

BPE tokenizer for bilingual Text-to-Speech supporting English and Urdu.

Vocabulary Structure

  • Total Size: 30,000
  • Text Tokens: ~17,169 (BPE-trained on English + Urdu)
  • Audio Codes: 12,801 (<|code_0|> to <|code_12800|>)
  • Special Tokens: ~30

Languages

  • English - Latin script (left-to-right)
  • Urdu (اردو) - Arabic-Persian script (right-to-left)

Training Data

English:

  • Dataset: humair025/ultra-unified
  • Samples: 250,000

Urdu:

  • Dataset: humair025/Munch-Audio-Annoted
  • Samples: 245,972

Total: 495,972 text samples

Tokenization Method

BPE (Byte-Pair Encoding) with ByteLevel encoding:

  • Handles both English and Urdu characters
  • Supports all UTF-8 characters
  • Efficient subword tokenization
  • Perfect round-trip encoding/decoding

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zuhri025/bilingual-tts-tokenizer")

# English
en_text = "Hello world"
en_tokens = tokenizer.encode(en_text)

# Urdu
ur_text = "یہ ایک ٹیسٹ ہے"
ur_tokens = tokenizer.encode(ur_text)

# Both work with same tokenizer!

Special Tokens

Language Tokens:

  • <|en|> - English language marker
  • <|ur|> - Urdu language marker

Speech Structure:

  • <|start_of_speech|>, <|end_of_speech|>, <|speech|>
  • <|start_of_text|>, <|end_of_text|>

Voice/Gender:

  • <|male|>, <|female|>, <|other|>
  • <|puck|>, <|kore|> (Gemini voices)

English Accents:

  • <|accent_us|>, <|accent_uk|>

Model Compatibility

Designed for:

  • Bilingual English-Urdu TTS models
  • BPE subword tokenization
  • Audio codec models with 12,801 codes

Character Support

English:

  • Latin alphabet (26 letters)
  • Numbers, punctuation
  • Diacritics (é, ñ, etc.)

Urdu:

  • Urdu alphabet (28 letters)
  • Arabic-Persian script
  • Diacritical marks (zabar, zer, pesh, etc.)
  • Urdu numerals
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support