README / README.md
ArabovMK's picture
Update README.md
8db23a7 verified
---
title: TajikNLPWorld - Persian NLP & Low-Resource Languages
emoji: 🏔️
colorFrom: green
colorTo: red
sdk: gradio
pinned: true
license: mit
short_description: Tajik & Persian NLP for low-resource languages.
sdk_version: 6.6.0
---
# TajikNLPWorld – Persian NLP & Low‑Resource Languages Research Hub
![Status](https://img.shields.io/badge/Status-Active-brightgreen)
![Focus](https://img.shields.io/badge/Focus-Tajik_Language-red)
![Focus](https://img.shields.io/badge/Focus-Persian_NLP-orange)
![Focus](https://img.shields.io/badge/Focus-Low_Resource_Languages-blue)
**TajikNLPWorld** is a collaborative research initiative dedicated to advancing natural language processing for **Tajik**, **Persian/Dari**, and **low‑resource languages** in the Iranian language family. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.
---
## 🎯 Our Mission
- Build **open‑source language models** for Tajik and other Persian varieties (Dari, Hazaragi).
- Create **high‑quality linguistic resources** (corpora, lexicons, evaluation benchmarks).
- Advance **machine translation** between Tajik, Persian, Dari, and major world languages.
- Develop **educational materials** and interactive demos to lower the entry barrier for low‑resource NLP.
- Foster a community of researchers, developers, and native speakers working together on language technology.
---
## 🚀 Interactive Demos
Explore our live Hugging Face Spaces and try out our models directly in your browser:
### **🔤 Language Models**
- **[TajikGPT Playground]()** – Generate and analyze Tajik text with our latest causal LM.
- **[PersianBERT Explorer]()** – Masked language modelling for Tajik, Persian, and Dari.
- **[Multilingual Embeddings]()** – Compare word/sentence vectors across Iranian languages.
### **🌐 Machine Translation**
- **[Tajik ↔ Persian Translator]()** – Neural translation between Tajik and Persian (Farsi).
- **[Tajik ↔ Russian Translator]()** – Translation demo for Central Asian context.
- **[Persian Multi-Dialect Translation]()** – Translate between Tajik, Dari, and Iranian Persian.
### **📚 Linguistic Tools**
- **[Tajik Morphological Analyzer]()** – Interactive segmentation and POS tagging (Cyrillic & Perso-Arabic scripts).
- **[Named Entity Recognition for Tajik]()** – Identify persons, locations, organizations.
- **[Script Converter]()** – Convert between Cyrillic Tajik and Perso-Arabic script.
### **📊 Data & Benchmarks**
- **[Tajik Corpus Explorer]()** – Browse and query our curated text collections.
- **[Persian NLP Leaderboard]()** – Compare model performance on Tajik, Persian, and Dari tasks.
- **[Annotation Tools]()** – Help us improve datasets with your feedback.
*Click on any demo to start experimenting – no installation required!*
---
## 🧠 Research Focus Areas
### **🏔️ Tajik Language Technologies**
- Creation of the first large‑scale pretrained models for Tajik (both Cyrillic and Perso-Arabic scripts).
- Morphological disambiguation and syntactic parsing for Tajik.
- Speech recognition and synthesis for Tajik (coming soon).
### **📜 Persian & Iranian NLP**
- Cross‑dialectal transfer learning among Tajik, Dari, and Iranian Persian.
- Unified tokenization and subword models for the Persian language continuum.
- Machine translation between Persian varieties and major languages.
### **📉 Low‑Resource NLP**
- Data augmentation and semi‑supervised learning techniques.
- Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
- Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.
### **🤖 Language Models**
- Pretraining from scratch and continued pretraining on Persian/Tajik corpora.
- Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
- Evaluation and bias analysis of Persian language models.
### **📖 Linguistic Resources**
- **Corpora**: News, literature, web‑crawled texts, social media.
- **Lexicons**: Morphological dictionaries, wordnets, sentiment lexicons.
- **Benchmarks**: Named entity recognition, part‑of‑speech tagging, machine translation test sets.
---
## 📦 Models & Datasets
We release all our models and datasets on Hugging Face Hub under open licenses.
| Model / Dataset | Description | Link |
|-----------------|-------------|------|
| **TajikBERT** | BERT‑base model pretrained on Tajik Cyrillic and Perso-Arabic texts | [🤗 Hub]() |
| **Persian‑mT5** | Multilingual T5 fine‑tuned on Tajik, Persian, and Dari | [🤗 Hub]() |
| **Tajik‑MT‑TgFa** | Transformer‑based translation model (Tajik ↔ Persian/Farsi) | [🤗 Hub]() |
| **Tajik‑NER** | Named entity recognition model for Tajik | [🤗 Hub]() |
| **TajikCorpus v1.0** | 150M token corpus from news, books, and websites (bilingual scripts) | [🤗 Dataset]() |
| **Persian‑Dialect‑Bench** | Parallel sentences for Tajik, Dari, and Iranian Persian | [🤗 Dataset]() |
*More models and datasets are added regularly. Follow our [organization page](https://huggingface.co/TajikNLPWorld) for updates.*
---
## 📚 Educational Resources
We believe in **open education** and **reproducible research**. All our tutorials and teaching materials are freely available.
- **[Interactive Notebooks]()** – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
- **[Video Lectures]()** – Recorded talks on Persian/Tajik NLP, data collection, and model training.
- **[Course Materials]()** – Slides, readings, and assignments from our university courses.
- **[Blog Posts]()** – Deep dives into challenges and solutions for Tajik and Persian languages.
---
## 📝 Selected Publications
1. *"TajikBERT: A Pretrained Language Model for Tajik in Cyrillic and Perso-Arabic Scripts"* – LREC 2024
2. *"Bridging Dialects: Machine Translation for Tajik, Dari, and Persian"* – WMT 2023
3. *"Building a Named Entity Recognition Dataset for Tajik"* – IranNLP 2023
4. *"Multilingual Representations for Iranian Languages: A Comparative Study"* – EMNLP 2022
5. *"Tajik Corpus: Collection, Annotation, and Baseline Experiments"* – Dialogue 2022
*Full list with links to PDFs available on our [Publications Page]().*
---
## 🤝 Get Involved
We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.
### **For Researchers**
- Use our models and datasets in your work (and cite us!).
- Collaborate on joint papers and grant proposals.
- Contribute new benchmarks or evaluation tasks.
### **For Developers**
- Integrate our models into your applications.
- Report bugs or suggest improvements via GitHub Issues.
- Submit pull requests to our open‑source repositories.
### **For Native Speakers & Linguists**
- Help us validate translations and annotations.
- Share texts or corpora (with permission) to enrich our data.
- Provide feedback on model outputs to reduce errors.
### **For Students**
- Use our demos and tutorials for learning.
- Participate in our mentorship program or summer schools.
- Start your own research project with our support.
---
## 🌐 Connect With Us
- **🤗 Hugging Face**: [TajikNLPWorld](https://huggingface.co/TajikNLPWorld) – Models, datasets, and spaces.
- **💻 GitHub**: [TajikNLPWorld](https://github.com/TajikNLPWorld) – Source code, development, and issue tracking.
- **📧 Email**: [contact@tajiknlp.world](mailto:contact@tajiknlp.world) – General inquiries and collaboration.
- **🐦 Twitter/X**: [@TajikNLP](https://twitter.com/TajikNLP) – News and updates.
- **📝 Blog**: [Medium/TajikNLPWorld](https://medium.com/tajiknlpworld) – In‑depth articles.
---
## 🔄 Ecosystem Integration
Our work is integrated with the broader Hugging Face ecosystem:
- **Models** on the Hub with easy‑to‑use pipelines.
- **Datasets** with streaming and evaluation scripts.
- **Spaces** for interactive demos and educational tools.
- **Gradio** apps for user‑friendly interfaces.
---
**Empowering Tajik and Persian languages through open science and community collaboration.**
<div align="center">
[![Hugging Face](https://img.shields.io/badge/🤗-TajikNLPWorld-yellow)](https://huggingface.co/TajikNLPWorld)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/TajikNLPWorld)
[![Twitter](https://img.shields.io/badge/Twitter-@TajikNLP-blue)](https://twitter.com/TajikNLP)
**© 2026 TajikNLPWorld** – Open source for low‑resource languages.
</div>