T5 Extended Tokenizer for Dhivehi

This tokenizer extends google/flan-t5-base to support Dhivehi (Thaana script) characters, while preserving English subword tokenization.

Base model: google/flan-t5-base
Extension: Adds characters in the Thaana Unicode range (U+0780–U+07BF)
Purpose: For Dhivehi-English tasks like translation, summarization, or instruction tuning
English tokens remain unchanged (_This, _is, etc.)

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-T5-tokenizer-extended")

# Dhivehi example
text = "ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!"
tokens = tokenizer.tokenize(text)
print(tokens)

What’s Different?

Feature	Stock Flan-T5 Tokenizer	Extended Dhivehi Tokenizer
Dhivehi support	❌ Uses	Proper tokenization
English tokenization	✅ Yes	Preserved
Added tokens	❌ No	Thaana characters

Comparison with Stock Flan-T5

The stock flan-t5-base tokenizer does not support Dhivehi text properly:

from transformers import AutoTokenizer

stock_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
tokens = stock_tokenizer.tokenize("ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!")
print(tokens)
# Output: ['<unk>', '<unk>']

In contrast, the extended tokenizer will tokenize Thaana characters individually or as learned units, preserving semantics and avoiding <unk> tokens.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

alakxender
/

dhivehi-T5-tokenizer-extended

T5 Extended Tokenizer for Dhivehi

How to Use

What’s Different?

Comparison with Stock Flan-T5

Dataset used to train alakxender/dhivehi-T5-tokenizer-extended