File size: 1,112 Bytes
7fa4edc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60692f8
7fa4edc
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
library_name: transformers
tokenizer_class: M2M100Tokenizer
tags:
- tokenizer
- sentencepiece
- roman-urdu
- urdu
- transliteration
---

# m2m100_rup_tokenizer_both

This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models:  

- [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur)  
- [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur)  

It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**:  

- `__ur__` for Urdu  
- `__roman-ur__` for Roman Urdu  

These tokens are stored in `added_tokens.json` and are required for correct transliteration.

---

When preparing input for models, prepend the correct language token (`__roman-ur__` or `__ur__`) to the text.

@inproceedings{butt2025romanurdu,
  title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models},
  author = {Umer Butt, Stalin Varanasi, Günter Neumann},
  year = {2025},
  booktitle = {LoResMT Workshop @ NAACL 2025}
}