Spaces:
No application file
No application file
| title: BashkirNLP - Turkic & Low-Resource Languages | |
| emoji: 🏔️ | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| pinned: true | |
| license: mit | |
| short_description: Bashkir & Turkic NLP for low-resource languages. | |
| sdk_version: 6.6.0 | |
| # BashkirNLP – Turkic & Low‑Resource Languages Research Hub | |
|  | |
|  | |
|  | |
|  | |
| **BashkirNLP** is a collaborative research initiative dedicated to advancing natural language processing for **Bashkir**, **Turkic languages**, and **low‑resource languages** in the Ural-Volga region and beyond. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age. | |
| --- | |
| ## 🎯 Our Mission | |
| - Build **open‑source language models** for Bashkir and other Turkic varieties (Tatar, Kazakh, Chuvash, etc.). | |
| - Create **high‑quality linguistic resources** (corpora, lexicons, evaluation benchmarks). | |
| - Advance **machine translation** between Bashkir, Russian, English, and major Turkic languages. | |
| - Develop **educational materials** and interactive demos to lower the entry barrier for low‑resource NLP. | |
| - Foster a community of researchers, developers, and native speakers working together on language technology. | |
| --- | |
| ## 🚀 Interactive Demos | |
| Explore our live Hugging Face Spaces and try out our models directly in your browser: | |
| ### **🔤 Language Models** | |
| - **[BashkirGPT Playground]()** – Generate and analyze Bashkir text with our latest causal LM. | |
| - **[TurkicBERT Explorer]()** – Masked language modelling for Bashkir, Tatar, and other Turkic languages. | |
| - **[Multilingual Embeddings]()** – Compare word/sentence vectors across Turkic languages. | |
| ### **🌐 Machine Translation** | |
| - **[Bashkir ↔ Russian Translator]()** – Neural translation between Bashkir and Russian. | |
| - **[Bashkir ↔ Tatar Translator]()** – Translation demo for closely related Turkic languages. | |
| - **[Bashkir ↔ English Translator]()** – Experimental translation for low-resource pairs. | |
| ### **📚 Linguistic Tools** | |
| - **[Bashkir Morphological Analyzer]()** – Interactive segmentation and POS tagging (Cyrillic & Latin scripts). | |
| - **[Named Entity Recognition for Bashkir]()** – Identify persons, locations, organizations. | |
| - **[Script Converter]()** – Convert between Cyrillic Bashkir and Latin-based orthographies. | |
| ### **📊 Data & Benchmarks** | |
| - **[Bashkir Corpus Explorer]()** – Browse and query our curated text collections. | |
| - **[Turkic NLP Leaderboard]()** – Compare model performance on Bashkir, Tatar, and other Turkic tasks. | |
| - **[Annotation Tools]()** – Help us improve datasets with your feedback. | |
| *Click on any demo to start experimenting – no installation required!* | |
| --- | |
| ## 🧠 Research Focus Areas | |
| ### **🏞️ Bashkir Language Technologies** | |
| - Creation of the first large‑scale pretrained models for Bashkir (Cyrillic script, with Latin adaptation). | |
| - Morphological disambiguation and syntactic parsing for Bashkir (agglutinative morphology). | |
| - Speech recognition and synthesis for Bashkir (coming soon). | |
| ### **📜 Turkic NLP** | |
| - Cross‑lingual transfer learning among Bashkir, Tatar, Kazakh, and other Kipchak languages. | |
| - Unified tokenization and subword models for the Turkic language family. | |
| - Machine translation between Turkic languages and major world languages. | |
| ### **📉 Low‑Resource NLP** | |
| - Data augmentation and semi‑supervised learning techniques. | |
| - Leveraging multilingual models (e.g., mT5, XLM‑R, Turkmenglish) for under‑represented languages. | |
| - Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis. | |
| ### **🤖 Language Models** | |
| - Pretraining from scratch and continued pretraining on Bashkir/Turkic corpora. | |
| - Efficient architectures (ALBERT, DistilBERT) for low‑resource settings. | |
| - Evaluation and bias analysis of Turkic language models. | |
| ### **📖 Linguistic Resources** | |
| - **Corpora**: News, literature, web‑crawled texts, social media (e.g., VK, Telegram). | |
| - **Lexicons**: Morphological dictionaries, wordnets, sentiment lexicons. | |
| - **Benchmarks**: Named entity recognition, part‑of‑speech tagging, machine translation test sets. | |
| --- | |
| ## 📦 Models & Datasets | |
| We release all our models and datasets on Hugging Face Hub under open licenses. | |
| | Model / Dataset | Description | Link | | |
| |-----------------|-------------|------| | |
| | **BashkirBERT** | BERT‑base model pretrained on Bashkir Cyrillic texts | [🤗 Hub]() | | |
| | **Turkic‑mT5** | Multilingual T5 fine‑tuned on Bashkir, Tatar, and Kazakh | [🤗 Hub]() | | |
| | **Bashkir‑MT‑BaRu** | Transformer‑based translation model (Bashkir ↔ Russian) | [🤗 Hub]() | | |
| | **Bashkir‑NER** | Named entity recognition model for Bashkir | [🤗 Hub]() | | |
| | **BashkirCorpus v1.0** | 100M token corpus from news, books, and websites | [🤗 Dataset]() | | |
| | **Turkic‑Parallel‑Bench** | Parallel sentences for Bashkir, Tatar, and Turkish | [🤗 Dataset]() | | |
| *More models and datasets are added regularly. Follow our [organization page](https://huggingface.co/BashkirNLP) for updates.* | |
| --- | |
| ## 📚 Educational Resources | |
| We believe in **open education** and **reproducible research**. All our tutorials and teaching materials are freely available. | |
| - **[Interactive Notebooks]()** – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries). | |
| - **[Video Lectures]()** – Recorded talks on Bashkir/Turkic NLP, data collection, and model training. | |
| - **[Course Materials]()** – Slides, readings, and assignments from our university courses. | |
| - **[Blog Posts]()** – Deep dives into challenges and solutions for Bashkir and Turkic languages. | |
| --- | |
| ## 📝 Selected Publications | |
| 1. *"BashkirBERT: A Pretrained Language Model for Bashkir"* – LREC 2025 (planned) | |
| 2. *"Machine Translation for Low-Resource Turkic Languages: Bashkir–Russian Case Study"* – WMT 2024 | |
| 3. *"Building a Named Entity Recognition Dataset for Bashkir"* – TurkicLang 2024 | |
| 4. *"Multilingual Representations for Kipchak Languages: A Comparative Study"* – EMNLP 2023 | |
| 5. *"Bashkir Corpus: Collection, Annotation, and Baseline Experiments"* – Dialogue 2023 | |
| *Full list with links to PDFs available on our [Publications Page]().* | |
| --- | |
| ## 🤝 Get Involved | |
| We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker. | |
| ### **For Researchers** | |
| - Use our models and datasets in your work (and cite us!). | |
| - Collaborate on joint papers and grant proposals. | |
| - Contribute new benchmarks or evaluation tasks. | |
| ### **For Developers** | |
| - Integrate our models into your applications. | |
| - Report bugs or suggest improvements via GitHub Issues. | |
| - Submit pull requests to our open‑source repositories. | |
| ### **For Native Speakers & Linguists** | |
| - Help us validate translations and annotations. | |
| - Share texts or corpora (with permission) to enrich our data. | |
| - Provide feedback on model outputs to reduce errors. | |
| ### **For Students** | |
| - Use our demos and tutorials for learning. | |
| - Participate in our mentorship program or summer schools. | |
| - Start your own research project with our support. | |
| --- | |
| ## 🌐 Connect With Us | |
| - **🤗 Hugging Face**: [BashkirNLP](https://huggingface.co/BashkirNLP) – Models, datasets, and spaces. | |
| - **💻 GitHub**: [BashkirNLP](https://github.com/BashkirNLP) – Source code, development, and issue tracking. | |
| - **📧 Email**: [contact@bashkirnlp.org](mailto:contact@bashkirnlp.org) – General inquiries and collaboration. | |
| - **📝 Blog**: [Medium/BashkirNLP](https://medium.com/bashkirnlp) – In‑depth articles. | |
| --- | |
| ## 🔄 Ecosystem Integration | |
| Our work is integrated with the broader Hugging Face ecosystem: | |
| - **Models** on the Hub with easy‑to‑use pipelines. | |
| - **Datasets** with streaming and evaluation scripts. | |
| - **Spaces** for interactive demos and educational tools. | |
| - **Gradio** apps for user‑friendly interfaces. | |
| --- | |
| **Empowering Bashkir and Turkic languages through open science and community collaboration.** | |
| <div align="center"> | |
| [](https://huggingface.co/BashkirNLP) | |
| [](https://github.com/BashkirNLP) | |
| [](https://twitter.com/BashkirNLP) | |
| **© 2026 BashkirNLP** – Open source for low‑resource languages. | |
| </div> |