lutfiy / README.md
murodbek's picture
adding paper link and fix_zwnj
6724275 verified
---
library_name: transformers
license: cc-by-nc-4.0
tags:
- nllb
- uzs
- Southern Uzbek
- Afghani Uzbek
language:
- en
- uz
- uzs
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
datasets:
- tahrirchi/lutfiy
---
# Lutfiy: Southern Uzbek Machine Translation Model
This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".
## Model details
| Model | Tokenizer Length | Parameter Count |
|-------|------------|-------------------|
[`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) | 256,204 | 615M |
**Common attributes:**
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Languages:** Southern Uzbek, Northern Uzbek, English
## Intended uses & limitations
These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.
### How to use
You can use these models with the Transformers library. Here's a quick example:
#### Install `lutfiy` library for fixing ZWNJ
```bash
pip install lutfiy
```
```python
from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_ckpt = "tahrirchi/lutfiy"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."
tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.
```
## Training data
The models were trained on a parallel corpus of 40,000 sentence pairs, including:
- Northern Uzbek - Southern Uzbek (37,415 pairs)
- English - Southern Uzbek (2,579 pairs)
The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).
## Training procedure
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586).
## Citation
If you use these models in your research, please cite our paper:
```bibtex
@misc{mamasaidov2025fillinggapuzbekcreating,
title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek},
author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
year={2025},
eprint={2508.14586},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14586},
}
```
## Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.
For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact.