alakxender/dhivehi-news-corpus
Viewer • Updated • 87.2k • 209
How to use alakxender/deberta-dhivehi-tokenizer-extended with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("alakxender/deberta-dhivehi-tokenizer-extended", dtype="auto")This repository contains a custom extension of the microsoft/deberta-v3-base tokenizer, enhanced with 10,000 frequent Dhivehi tokens. It significantly improves tokenization coverage and accuracy for Dhivehi while preserving English behavior.
microsoft/deberta-v3-base (vocab size: 128001)Input:The quick brown fox jumps over the lazy dog.
Tokens (STOCK & CUSTOM):
['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁lazy', '▁dog', '.']
✔️ Token IDs identical — English tokenization is preserved.
Input:އީދުގެ ހަރަކާތް ފެށުމަށް މިރޭ ހުޅުމާލޭގައި އީދު މަޅި ރޯކުރަނީ
STOCK Tokens (fragmented):
['▁', 'އ', 'ީދ', 'ު', 'ގ', 'ެ', '▁', 'ހ', 'ަ', 'ރ', 'ަ', ...]
CUSTOM Tokens (clean and meaningful):
['އީދުގެ', '▁', 'ހަރަކާތް', '▁', 'ފެށުމަށް', ...]
✔️ Long, language-meaningful tokens reduce fragmentation and UNKs.
STOCK (fragmented):
Token IDs: [..., 3, 3, 3, ...] # many unknowns
CUSTOM (extended):
Token IDs: [137561, 130775, 129048, ...]
Clean and consistent token IDs for Thaana tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alakxender/deberta-dhivehi-tokenizer-extended")
tokens = tokenizer.tokenize("އީދުގެ ހަރަކާތް")
print(tokens)