metadata
library_name: transformers
tokenizer_class: M2M100Tokenizer
tags:
- tokenizer
- sentencepiece
- roman-urdu
- urdu
- transliteration
m2m100_rup_tokenizer_both
This repository hosts the shared tokenizer used for our Roman Urdu ↔ Urdu transliteration models:
It is based on M2M100Tokenizer and extended with custom language tokens:
__ur__for Urdu__roman-ur__for Roman Urdu
These tokens are stored in added_tokens.json and are required for correct transliteration.
When preparing input for models, prepend the correct language token (roman-ur or ur) to the text.
@inproceedings{butt2025romanurdu, title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, author = {Umer Butt, Stalin Varanasi, Günter Neumann}, year = {2025}, booktitle = {LoResMT Workshop @ NAACL 2025} }