File size: 3,094 Bytes
1123c58 e4bf598 1123c58 ef3e746 1123c58 6724275 1123c58 e4bf598 1123c58 6724275 1123c58 6724275 1123c58 ef3e746 1123c58 65619e0 1123c58 65619e0 1123c58 6724275 1123c58 e4bf598 1123c58 e4bf598 1123c58 6724275 1123c58 6724275 1123c58 6724275 1123c58 6724275 1123c58 6724275 1123c58 e4bf598 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ---
library_name: transformers
license: cc-by-nc-4.0
tags:
- nllb
- uzs
- Southern Uzbek
- Afghani Uzbek
language:
- en
- uz
- uzs
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
datasets:
- tahrirchi/lutfiy
---
# Lutfiy: Southern Uzbek Machine Translation Model
This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".
## Model details
| Model | Tokenizer Length | Parameter Count |
|-------|------------|-------------------|
[`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) | 256,204 | 615M |
**Common attributes:**
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Languages:** Southern Uzbek, Northern Uzbek, English
## Intended uses & limitations
These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.
### How to use
You can use these models with the Transformers library. Here's a quick example:
#### Install `lutfiy` library for fixing ZWNJ
```bash
pip install lutfiy
```
```python
from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_ckpt = "tahrirchi/lutfiy"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."
tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.
```
## Training data
The models were trained on a parallel corpus of 40,000 sentence pairs, including:
- Northern Uzbek - Southern Uzbek (37,415 pairs)
- English - Southern Uzbek (2,579 pairs)
The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).
## Training procedure
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586).
## Citation
If you use these models in your research, please cite our paper:
```bibtex
@misc{mamasaidov2025fillinggapuzbekcreating,
title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek},
author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
year={2025},
eprint={2508.14586},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14586},
}
```
## Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.
For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact. |