--- library_name: transformers tokenizer_class: M2M100Tokenizer tags: - tokenizer - sentencepiece - roman-urdu - urdu - transliteration --- # m2m100_rup_tokenizer_both This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models: - [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur) - [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur) It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**: - `__ur__` for Urdu - `__roman-ur__` for Roman Urdu These tokens are stored in `added_tokens.json` and are required for correct transliteration. --- When preparing input for models, prepend the correct language token (`__roman-ur__` or `__ur__`) to the text. @inproceedings{butt2025romanurdu, title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, author = {Umer Butt, Stalin Varanasi, Günter Neumann}, year = {2025}, booktitle = {LoResMT Workshop @ NAACL 2025} }