Muhammad Umer Tariq Butt
Initial commit: Roman Urdu ↔ Urdu tokenizer
af72f19
metadata
library_name: transformers
tokenizer_class: M2M100Tokenizer
tags:
  - tokenizer
  - sentencepiece
  - roman-urdu
  - urdu
  - transliteration

m2m100_rup_tokenizer_both

This repository hosts the shared tokenizer used for our Roman Urdu ↔ Urdu transliteration models:

It is based on M2M100Tokenizer and extended with custom language tokens:

  • __ur__ for Urdu
  • __roman-ur__ for Roman Urdu

These tokens are stored in added_tokens.json and are required for correct transliteration.


When preparing input for models, prepend the correct language token (roman-ur or ur) to the text.

@inproceedings{butt2025romanurdu, title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, author = {Umer Butt, Stalin Varanasi, Günter Neumann}, year = {2025}, booktitle = {LoResMT Workshop @ NAACL 2025} }