YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bilingual TTS Tokenizer (English + Urdu)
BPE tokenizer for bilingual Text-to-Speech supporting English and Urdu.
Vocabulary Structure
- Total Size: 30,000
- Text Tokens: ~17,169 (BPE-trained on English + Urdu)
- Audio Codes: 12,801 (
<|code_0|>to<|code_12800|>) - Special Tokens: ~30
Languages
- English - Latin script (left-to-right)
- Urdu (اردو) - Arabic-Persian script (right-to-left)
Training Data
English:
- Dataset: humair025/ultra-unified
- Samples: 250,000
Urdu:
- Dataset: humair025/Munch-Audio-Annoted
- Samples: 245,972
Total: 495,972 text samples
Tokenization Method
BPE (Byte-Pair Encoding) with ByteLevel encoding:
- Handles both English and Urdu characters
- Supports all UTF-8 characters
- Efficient subword tokenization
- Perfect round-trip encoding/decoding
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("zuhri025/bilingual-tts-tokenizer")
# English
en_text = "Hello world"
en_tokens = tokenizer.encode(en_text)
# Urdu
ur_text = "یہ ایک ٹیسٹ ہے"
ur_tokens = tokenizer.encode(ur_text)
# Both work with same tokenizer!
Special Tokens
Language Tokens:
<|en|>- English language marker<|ur|>- Urdu language marker
Speech Structure:
<|start_of_speech|>,<|end_of_speech|>,<|speech|><|start_of_text|>,<|end_of_text|>
Voice/Gender:
<|male|>,<|female|>,<|other|><|puck|>,<|kore|>(Gemini voices)
English Accents:
<|accent_us|>,<|accent_uk|>
Model Compatibility
Designed for:
- Bilingual English-Urdu TTS models
- BPE subword tokenization
- Audio codec models with 12,801 codes
Character Support
English:
- Latin alphabet (26 letters)
- Numbers, punctuation
- Diacritics (é, ñ, etc.)
Urdu:
- Urdu alphabet (28 letters)
- Arabic-Persian script
- Diacritical marks (zabar, zer, pesh, etc.)
- Urdu numerals
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support