T5 Extended Tokenizer for Dhivehi
This tokenizer extends google/flan-t5-base to support Dhivehi (Thaana script) characters, while preserving English subword tokenization.
- Base model:
google/flan-t5-base - Extension: Adds characters in the Thaana Unicode range (U+0780–U+07BF)
- Purpose: For Dhivehi-English tasks like translation, summarization, or instruction tuning
- English tokens remain unchanged (
_This,_is, etc.)
How to Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-T5-tokenizer-extended")
# Dhivehi example
text = "ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!"
tokens = tokenizer.tokenize(text)
print(tokens)
What’s Different?
| Feature | Stock Flan-T5 Tokenizer | Extended Dhivehi Tokenizer |
|---|---|---|
| Dhivehi support | ❌ Uses | Proper tokenization |
| English tokenization | ✅ Yes | Preserved |
| Added tokens | ❌ No | Thaana characters |
Comparison with Stock Flan-T5
The stock flan-t5-base tokenizer does not support Dhivehi text properly:
from transformers import AutoTokenizer
stock_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
tokens = stock_tokenizer.tokenize("ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!")
print(tokens)
# Output: ['<unk>', '<unk>']
In contrast, the extended tokenizer will tokenize Thaana characters individually or as learned units, preserving semantics and avoiding <unk> tokens.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support