Muhammad Umer Tariq Butt commited on
Commit
7fa4edc
·
1 Parent(s): af72f19

Rename readme.md to README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tokenizer_class: M2M100Tokenizer
4
+ tags:
5
+ - tokenizer
6
+ - sentencepiece
7
+ - roman-urdu
8
+ - urdu
9
+ - transliteration
10
+ ---
11
+
12
+ # m2m100_rup_tokenizer_both
13
+
14
+ This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models:
15
+
16
+ - [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur)
17
+ - [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur)
18
+
19
+ It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**:
20
+
21
+ - `__ur__` for Urdu
22
+ - `__roman-ur__` for Roman Urdu
23
+
24
+ These tokens are stored in `added_tokens.json` and are required for correct transliteration.
25
+
26
+ ---
27
+
28
+ When preparing input for models, prepend the correct language token (__roman-ur__ or __ur__) to the text.
29
+
30
+ @inproceedings{butt2025romanurdu,
31
+ title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models},
32
+ author = {Umer Butt, Stalin Varanasi, Günter Neumann},
33
+ year = {2025},
34
+ booktitle = {LoResMT Workshop @ NAACL 2025}
35
+ }