Transformers
Divehi
English
thaana-tokenizer

Dhivehi-Roberta Tokenizer (Extended Vocabulary)

This repository contains a custom extension of the roberta-base tokenizer, enhanced with 10,000 frequently used Dhivehi tokens.

Overview

  • Base tokenizer: roberta-base (vocab size: 50265)
  • Dhivehi tokens added: 10,000
  • Source: Top 424,319 unique Dhivehi tokens (filtered to 3+ characters)
  • Use case: Improves tokenization of Dhivehi text for models like TrOCR, mBERT, or RoBERTa.

Comparison Results

On Dhivehi Text: އީދުގެ ހަރަކާތް

🔹 Stock roberta-base

Tokens: ['Þ', 'ĩ', 'Þ', '©', 'Þ', 'ĭ', 'Þ', 'ª', 'Þ', 'İ', 'Þ', '¬', 'Ġ', ...]
Token IDs: [49944, 6382, 49944, 15375, 49944, 13859, ...]

🔸 Custom Dhivehi Extended

Tokens: ['އީދުގެ', 'Ġ', 'ހަރަކާތް']
Token IDs: [59825, 1437, 53039]

The extended tokenizer produces clean, language-aware tokens from valid Thaana characters, unlike the corrupted outputs from the base tokenizer.

On English Text (No Change Expected)

Tokens: ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.']
Token IDs: [133, 2119, 6219, 23602, 13855, 81, 5, 22414, 2335, 4]

English performance remains identical to the original roberta-base.

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-roberta-tokenizer-extended")
tokens = tokenizer.tokenize("އީދުގެ ހަރަކާތް")
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train alakxender/dhivehi-roberta-tokenizer-extended