Spaces:

TajikNLPWorld
/

README

No application file

App Files Files Community

README / README.md

ArabovMK

Update README.md

8db23a7 verified about 1 month ago

preview code

raw

history blame contribute delete

8.71 kB

	---
	title: TajikNLPWorld - Persian NLP & Low-Resource Languages
	emoji: 🏔️
	colorFrom: green
	colorTo: red
	sdk: gradio
	pinned: true
	license: mit
	short_description: Tajik & Persian NLP for low-resource languages.
	sdk_version: 6.6.0
	---

	# TajikNLPWorld – Persian NLP & Low‑Resource Languages Research Hub

	![Status](https://img.shields.io/badge/Status-Active-brightgreen)
	![Focus](https://img.shields.io/badge/Focus-Tajik_Language-red)
	![Focus](https://img.shields.io/badge/Focus-Persian_NLP-orange)
	![Focus](https://img.shields.io/badge/Focus-Low_Resource_Languages-blue)

	TajikNLPWorld is a collaborative research initiative dedicated to advancing natural language processing for Tajik, Persian/Dari, and low‑resource languages in the Iranian language family. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.

	---

	## 🎯 Our Mission

	- Build open‑source language models for Tajik and other Persian varieties (Dari, Hazaragi).
	- Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
	- Advance machine translation between Tajik, Persian, Dari, and major world languages.
	- Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
	- Foster a community of researchers, developers, and native speakers working together on language technology.

	---

	## 🚀 Interactive Demos

	Explore our live Hugging Face Spaces and try out our models directly in your browser:

	### 🔤 Language Models
	- [TajikGPT Playground]() – Generate and analyze Tajik text with our latest causal LM.
	- [PersianBERT Explorer]() – Masked language modelling for Tajik, Persian, and Dari.
	- [Multilingual Embeddings]() – Compare word/sentence vectors across Iranian languages.

	### 🌐 Machine Translation
	- [Tajik ↔ Persian Translator]() – Neural translation between Tajik and Persian (Farsi).
	- [Tajik ↔ Russian Translator]() – Translation demo for Central Asian context.
	- [Persian Multi-Dialect Translation]() – Translate between Tajik, Dari, and Iranian Persian.

	### 📚 Linguistic Tools
	- [Tajik Morphological Analyzer]() – Interactive segmentation and POS tagging (Cyrillic & Perso-Arabic scripts).
	- [Named Entity Recognition for Tajik]() – Identify persons, locations, organizations.
	- [Script Converter]() – Convert between Cyrillic Tajik and Perso-Arabic script.

	### 📊 Data & Benchmarks
	- [Tajik Corpus Explorer]() – Browse and query our curated text collections.
	- [Persian NLP Leaderboard]() – Compare model performance on Tajik, Persian, and Dari tasks.
	- [Annotation Tools]() – Help us improve datasets with your feedback.

	Click on any demo to start experimenting – no installation required!

	---

	## 🧠 Research Focus Areas

	### 🏔️ Tajik Language Technologies
	- Creation of the first large‑scale pretrained models for Tajik (both Cyrillic and Perso-Arabic scripts).
	- Morphological disambiguation and syntactic parsing for Tajik.
	- Speech recognition and synthesis for Tajik (coming soon).

	### 📜 Persian & Iranian NLP
	- Cross‑dialectal transfer learning among Tajik, Dari, and Iranian Persian.
	- Unified tokenization and subword models for the Persian language continuum.
	- Machine translation between Persian varieties and major languages.

	### 📉 Low‑Resource NLP
	- Data augmentation and semi‑supervised learning techniques.
	- Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
	- Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

	### 🤖 Language Models
	- Pretraining from scratch and continued pretraining on Persian/Tajik corpora.
	- Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
	- Evaluation and bias analysis of Persian language models.

	### 📖 Linguistic Resources
	- Corpora: News, literature, web‑crawled texts, social media.
	- Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
	- Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.

	---

	## 📦 Models & Datasets

	We release all our models and datasets on Hugging Face Hub under open licenses.

	\| Model / Dataset \| Description \| Link \|
	\|-----------------\|-------------\|------\|
	\| TajikBERT \| BERT‑base model pretrained on Tajik Cyrillic and Perso-Arabic texts \| [🤗 Hub]() \|
	\| Persian‑mT5 \| Multilingual T5 fine‑tuned on Tajik, Persian, and Dari \| [🤗 Hub]() \|
	\| Tajik‑MT‑TgFa \| Transformer‑based translation model (Tajik ↔ Persian/Farsi) \| [🤗 Hub]() \|
	\| Tajik‑NER \| Named entity recognition model for Tajik \| [🤗 Hub]() \|
	\| TajikCorpus v1.0 \| 150M token corpus from news, books, and websites (bilingual scripts) \| [🤗 Dataset]() \|
	\| Persian‑Dialect‑Bench \| Parallel sentences for Tajik, Dari, and Iranian Persian \| [🤗 Dataset]() \|

	More models and datasets are added regularly. Follow our [organization page](https://huggingface.co/TajikNLPWorld) for updates.

	---

	## 📚 Educational Resources

	We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

	- [Interactive Notebooks]() – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
	- [Video Lectures]() – Recorded talks on Persian/Tajik NLP, data collection, and model training.
	- [Course Materials]() – Slides, readings, and assignments from our university courses.
	- [Blog Posts]() – Deep dives into challenges and solutions for Tajik and Persian languages.

	---

	## 📝 Selected Publications

	1. "TajikBERT: A Pretrained Language Model for Tajik in Cyrillic and Perso-Arabic Scripts" – LREC 2024
	2. "Bridging Dialects: Machine Translation for Tajik, Dari, and Persian" – WMT 2023
	3. "Building a Named Entity Recognition Dataset for Tajik" – IranNLP 2023
	4. "Multilingual Representations for Iranian Languages: A Comparative Study" – EMNLP 2022
	5. "Tajik Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2022

	Full list with links to PDFs available on our [Publications Page]().

	---

	## 🤝 Get Involved

	We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

	### For Researchers
	- Use our models and datasets in your work (and cite us!).
	- Collaborate on joint papers and grant proposals.
	- Contribute new benchmarks or evaluation tasks.

	### For Developers
	- Integrate our models into your applications.
	- Report bugs or suggest improvements via GitHub Issues.
	- Submit pull requests to our open‑source repositories.

	### For Native Speakers & Linguists
	- Help us validate translations and annotations.
	- Share texts or corpora (with permission) to enrich our data.
	- Provide feedback on model outputs to reduce errors.

	### For Students
	- Use our demos and tutorials for learning.
	- Participate in our mentorship program or summer schools.
	- Start your own research project with our support.

	---

	## 🌐 Connect With Us

	- 🤗 Hugging Face: [TajikNLPWorld](https://huggingface.co/TajikNLPWorld) – Models, datasets, and spaces.
	- 💻 GitHub: [TajikNLPWorld](https://github.com/TajikNLPWorld) – Source code, development, and issue tracking.
	- 📧 Email: [contact@tajiknlp.world](mailto:contact@tajiknlp.world) – General inquiries and collaboration.
	- 🐦 Twitter/X: [@TajikNLP](https://twitter.com/TajikNLP) – News and updates.
	- 📝 Blog: [Medium/TajikNLPWorld](https://medium.com/tajiknlpworld) – In‑depth articles.

	---

	## 🔄 Ecosystem Integration

	Our work is integrated with the broader Hugging Face ecosystem:

	- Models on the Hub with easy‑to‑use pipelines.
	- Datasets with streaming and evaluation scripts.
	- Spaces for interactive demos and educational tools.
	- Gradio apps for user‑friendly interfaces.

	---

	Empowering Tajik and Persian languages through open science and community collaboration.

	<div align="center">

	[![Hugging Face](https://img.shields.io/badge/🤗-TajikNLPWorld-yellow)](https://huggingface.co/TajikNLPWorld)
	[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/TajikNLPWorld)
	[![Twitter](https://img.shields.io/badge/Twitter-@TajikNLP-blue)](https://twitter.com/TajikNLP)

	© 2026 TajikNLPWorld – Open source for low‑resource languages.

	</div>