|
|
--- |
|
|
library_name: transformers |
|
|
tokenizer_class: M2M100Tokenizer |
|
|
tags: |
|
|
- tokenizer |
|
|
- sentencepiece |
|
|
- roman-urdu |
|
|
- urdu |
|
|
- transliteration |
|
|
--- |
|
|
|
|
|
# m2m100_rup_tokenizer_both |
|
|
|
|
|
This repository hosts the **shared tokenizer** used for our Roman Urdu ↔ Urdu transliteration models: |
|
|
|
|
|
- [Mavkif/m2m100_rup_ur_to_rur](https://huggingface.co/Mavkif/m2m100_rup_ur_to_rur) |
|
|
- [Mavkif/m2m100_rup_rur_to_ur](https://huggingface.co/Mavkif/m2m100_rup_rur_to_ur) |
|
|
|
|
|
It is based on [M2M100Tokenizer](https://huggingface.co/docs/transformers/model_doc/m2m_100) and extended with **custom language tokens**: |
|
|
|
|
|
- `__ur__` for Urdu |
|
|
- `__roman-ur__` for Roman Urdu |
|
|
|
|
|
These tokens are stored in `added_tokens.json` and are required for correct transliteration. |
|
|
|
|
|
--- |
|
|
|
|
|
When preparing input for models, prepend the correct language token (`__roman-ur__` or `__ur__`) to the text. |
|
|
|
|
|
@inproceedings{butt2025romanurdu, |
|
|
title = {Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models}, |
|
|
author = {Umer Butt, Stalin Varanasi, Günter Neumann}, |
|
|
year = {2025}, |
|
|
booktitle = {LoResMT Workshop @ NAACL 2025} |
|
|
} |