Transformers
Divehi
T5-dhivehi-tokenizer

T5 Extended Tokenizer for Dhivehi

This tokenizer extends google/flan-t5-base to support Dhivehi (Thaana script) characters, while preserving English subword tokenization.

  • Base model: google/flan-t5-base
  • Extension: Adds characters in the Thaana Unicode range (U+0780–U+07BF)
  • Purpose: For Dhivehi-English tasks like translation, summarization, or instruction tuning
  • English tokens remain unchanged (_This, _is, etc.)

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-T5-tokenizer-extended")

# Dhivehi example
text = "ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!"
tokens = tokenizer.tokenize(text)
print(tokens)

What’s Different?

Feature Stock Flan-T5 Tokenizer Extended Dhivehi Tokenizer
Dhivehi support ❌ Uses Proper tokenization
English tokenization ✅ Yes Preserved
Added tokens ❌ No Thaana characters

Comparison with Stock Flan-T5

The stock flan-t5-base tokenizer does not support Dhivehi text properly:

from transformers import AutoTokenizer

stock_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
tokens = stock_tokenizer.tokenize("ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!")
print(tokens)
# Output: ['<unk>', '<unk>']

In contrast, the extended tokenizer will tokenize Thaana characters individually or as learned units, preserving semantics and avoiding <unk> tokens.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train alakxender/dhivehi-T5-tokenizer-extended