tahrirchi
/

lutfiy

text2text-generation

Model card Files Files and versions

lutfiy / README.md

murodbek's picture

adding paper link and fix_zwnj

6724275 verified 6 months ago

|

history blame contribute delete

3.09 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	tags:
	- nllb
	- uzs
	- Southern Uzbek
	- Afghani Uzbek
	language:
	- en
	- uz
	- uzs
	base_model: facebook/nllb-200-distilled-600M
	pipeline_tag: translation
	datasets:
	- tahrirchi/lutfiy
	---
	# Lutfiy: Southern Uzbek Machine Translation Model

	This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".

	## Model details

	\| Model \| Tokenizer Length \| Parameter Count \|
	\|-------\|------------\|-------------------\|
	[`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) \| 256,204 \| 615M \|

	Common attributes:
	- Base Model: [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
	- Languages: Southern Uzbek, Northern Uzbek, English

	## Intended uses & limitations

	These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.

	### How to use

	You can use these models with the Transformers library. Here's a quick example:

	#### Install `lutfiy` library for fixing ZWNJ
	```bash
	pip install lutfiy
	```

	```python
	from lutfiy import fix_zwnj
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_ckpt = "tahrirchi/lutfiy"

	tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

	# Example translation
	input_text = "O'zbekiston kelajagi buyuk davlatdir."

	tokenizer.src_lang = "uzn_Latn"
	tokenizer.tgt_lang = "uzs_Arab"

	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs)
	translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.

	```

	## Training data

	The models were trained on a parallel corpus of 40,000 sentence pairs, including:
	- Northern Uzbek - Southern Uzbek (37,415 pairs)
	- English - Southern Uzbek (2,579 pairs)

	The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).

	## Training procedure

	For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586).

	## Citation

	If you use these models in your research, please cite our paper:

	```bibtex
	@misc{mamasaidov2025fillinggapuzbekcreating,
	title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek},
	author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
	year={2025},
	eprint={2508.14586},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.14586},
	}
	```

	## Contacts

	We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.

	For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact.