Muhammad Umer Tariq Butt
Initial commit: Roman Urdu ↔ Urdu tokenizer
af72f19
---
library_name: transformers
tokenizer_class: M2M100Tokenizer
tags:
- tokenizer
- sentencepiece
- roman-urdu
- urdu
- transliteration
---
# m2m100_rup_tokenizer_both
This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models:
- [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur)
- [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur)
It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**:
- `__ur__` for Urdu
- `__roman-ur__` for Roman Urdu
These tokens are stored in `added_tokens.json` and are required for correct transliteration.
---
When preparing input for models, prepend the correct language token (__roman-ur__ or __ur__) to the text.
@inproceedings{butt2025romanurdu,
title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models},
author = {Umer Butt, Stalin Varanasi, Günter Neumann},
year = {2025},
booktitle = {LoResMT Workshop @ NAACL 2025}
}