File size: 3,094 Bytes

---
library_name: transformers
license: cc-by-nc-4.0
tags:
- nllb
- uzs
- Southern Uzbek
- Afghani Uzbek
language:
- en
- uz
- uzs
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
datasets:
- tahrirchi/lutfiy
---
# Lutfiy: Southern Uzbek Machine Translation Model

This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".

## Model details

| Model | Tokenizer Length | Parameter Count |
|-------|------------|-------------------|
[`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) | 256,204 | 615M |

**Common attributes:**
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Languages:** Southern Uzbek, Northern Uzbek, English

## Intended uses & limitations

These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.

### How to use

You can use these models with the Transformers library. Here's a quick example:

#### Install `lutfiy` library for fixing ZWNJ
```bash
pip install lutfiy
```

```python
from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/lutfiy"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."

tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.

```

## Training data

The models were trained on a parallel corpus of 40,000 sentence pairs, including:
- Northern Uzbek - Southern Uzbek (37,415 pairs)
- English - Southern Uzbek (2,579 pairs)

The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).

## Training procedure

For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586).

## Citation

If you use these models in your research, please cite our paper:

```bibtex
@misc{mamasaidov2025fillinggapuzbekcreating,
      title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek}, 
      author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
      year={2025},
      eprint={2508.14586},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14586}, 
}
```

## Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek. 

For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact.