Dhivehi-Roberta Tokenizer (Extended Vocabulary)
This repository contains a custom extension of the roberta-base tokenizer, enhanced with 10,000 frequently used Dhivehi tokens.
Overview
- Base tokenizer:
roberta-base(vocab size: 50265) - Dhivehi tokens added: 10,000
- Source: Top 424,319 unique Dhivehi tokens (filtered to 3+ characters)
- Use case: Improves tokenization of Dhivehi text for models like TrOCR, mBERT, or RoBERTa.
Comparison Results
On Dhivehi Text: އީދުގެ ހަރަކާތް
🔹 Stock roberta-base
Tokens: ['Þ', 'ĩ', 'Þ', '©', 'Þ', 'ĭ', 'Þ', 'ª', 'Þ', 'İ', 'Þ', '¬', 'Ġ', ...]
Token IDs: [49944, 6382, 49944, 15375, 49944, 13859, ...]
🔸 Custom Dhivehi Extended
Tokens: ['އީދުގެ', 'Ġ', 'ހަރަކާތް']
Token IDs: [59825, 1437, 53039]
The extended tokenizer produces clean, language-aware tokens from valid Thaana characters, unlike the corrupted outputs from the base tokenizer.
On English Text (No Change Expected)
Tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']
Token IDs: [133, 2119, 6219, 23602, 13855, 81, 5, 22414, 2335, 4]
English performance remains identical to the original roberta-base.
How to Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-roberta-tokenizer-extended")
tokens = tokenizer.tokenize("އީދުގެ ހަރަކާތް")
print(tokens)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support