|
|
--- |
|
|
library_name: transformers |
|
|
license: cc-by-nc-4.0 |
|
|
tags: |
|
|
- nllb |
|
|
- uzs |
|
|
- Southern Uzbek |
|
|
- Afghani Uzbek |
|
|
language: |
|
|
- en |
|
|
- uz |
|
|
- uzs |
|
|
base_model: facebook/nllb-200-distilled-600M |
|
|
pipeline_tag: translation |
|
|
datasets: |
|
|
- tahrirchi/lutfiy |
|
|
--- |
|
|
# Lutfiy: Southern Uzbek Machine Translation Model |
|
|
|
|
|
This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek". |
|
|
|
|
|
## Model details |
|
|
|
|
|
| Model | Tokenizer Length | Parameter Count | |
|
|
|-------|------------|-------------------| |
|
|
[`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) | 256,204 | 615M | |
|
|
|
|
|
**Common attributes:** |
|
|
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) |
|
|
- **Languages:** Southern Uzbek, Northern Uzbek, English |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English. |
|
|
|
|
|
### How to use |
|
|
|
|
|
You can use these models with the Transformers library. Here's a quick example: |
|
|
|
|
|
#### Install `lutfiy` library for fixing ZWNJ |
|
|
```bash |
|
|
pip install lutfiy |
|
|
``` |
|
|
|
|
|
```python |
|
|
from lutfiy import fix_zwnj |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
model_ckpt = "tahrirchi/lutfiy" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_ckpt) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt) |
|
|
|
|
|
# Example translation |
|
|
input_text = "O'zbekiston kelajagi buyuk davlatdir." |
|
|
|
|
|
tokenizer.src_lang = "uzn_Latn" |
|
|
tokenizer.tgt_lang = "uzs_Arab" |
|
|
|
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs) |
|
|
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر. |
|
|
|
|
|
``` |
|
|
|
|
|
## Training data |
|
|
|
|
|
The models were trained on a parallel corpus of 40,000 sentence pairs, including: |
|
|
- Northern Uzbek - Southern Uzbek (37,415 pairs) |
|
|
- English - Southern Uzbek (2,579 pairs) |
|
|
|
|
|
The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash). |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use these models in your research, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{mamasaidov2025fillinggapuzbekcreating, |
|
|
title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek}, |
|
|
author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov}, |
|
|
year={2025}, |
|
|
eprint={2508.14586}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2508.14586}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contacts |
|
|
|
|
|
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek. |
|
|
|
|
|
For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact. |