| library_name: transformers | |
| tokenizer_class: M2M100Tokenizer | |
| tags: | |
| - tokenizer | |
| - sentencepiece | |
| - roman-urdu | |
| - urdu | |
| - transliteration | |
| # m2m100_rup_tokenizer_both | |
| This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models: | |
| - [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur) | |
| - [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur) | |
| It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**: | |
| - `__ur__` for Urdu | |
| - `__roman-ur__` for Roman Urdu | |
| These tokens are stored in `added_tokens.json` and are required for correct transliteration. | |
| --- | |
| When preparing input for models, prepend the correct language token (__roman-ur__ or __ur__) to the text. | |
| @inproceedings{butt2025romanurdu, | |
| title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, | |
| author = {Umer Butt, Stalin Varanasi, Günter Neumann}, | |
| year = {2025}, | |
| booktitle = {LoResMT Workshop @ NAACL 2025} | |
| } |