---
title: Tatar2Vec Explorer
emoji: ๐
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
app_file: app.py
---
# ๐ Tatar2Vec Explorer
**Discover the Power of Tatar Language AI**
*High-quality word embeddings for the Tatar language*
[](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
[](LICENSE)
[](https://streamlit.io)
## ๐ Overview
Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
## ๐ Features
### ๐ Semantic Search
- **Word Similarity**: Find semantically similar words
- **Vector Operations**: Perform complex word analogies
- **Interactive Visualizations**: Explore results with beautiful charts and word clouds
### ๐ง Advanced Analytics
- **Model Comparison**: Compare FastText vs Word2Vec performance
- **OOV Handling**: Test out-of-vocabulary word capabilities
- **Performance Metrics**: Detailed model evaluation scores
### ๐ฏ Model Variants
- **๐ฅ Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
- **๐ฅ Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
- **๐ฅ Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
- **๐ฅ Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
## ๐ Performance Highlights
| Model | Composite Score | Semantic Similarity | OOV Handling |
|-------|----------------|-------------------|-------------|
| **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
| Meta cc.tt.300 | 0.2000 | - | - |
| **Improvement** | **3.5ร** | **Significant** | **Perfect** |
## ๐ฎ Quick Start
### Try These Examples:
#### Word Similarity
```python
# Find words similar to "ะผำะบัำะฟ" (school)
similar_words = model.most_similar('ะผำะบัำะฟ', topn=10)
```
#### Word Analogies
```python
# Doctor - man + woman = ?
analogy = model.most_similar(
positive=['ัะฐะฑะธะฑ', 'ั
ะฐััะฝ'], # doctor, woman
negative=['ะธั'] # man
)
```
#### OOV Testing (FastText Only)
```python
# Handle unknown words
vector = model['ัะตั
ะฝะพะปะพะณะธัะปำััะตัาฏ'] # technology-related word
```
## ๐๏ธ Technical Details
### Training Corpus
- **Total Tokens**: 203.2 million
- **Vocabulary Size**: 637.7K words
- **Unique Words**: 1.8 million
- **Domains**: Wikipedia, news, books, social media
### Model Architecture
- **FastText**: Subword information support
- **Word2Vec**: Classical word embeddings
- **Optimized**: Skip-gram architecture, 100 dimensions
## ๐ Use Cases
### ๐ Education
- Language learning applications
- Educational content analysis
- Academic research
### ๐ผ Business
- Content recommendation systems
- Search engine enhancement
- Customer feedback analysis
### ๐ฌ Research
- Linguistic studies
- Cross-lingual comparisons
- AI model development
## ๐ ๏ธ Installation
### Local Development
```bash
git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
cd tatar2vec-demo
pip install -r requirements.txt
streamlit run app.py
```
### Docker Deployment
```bash
docker build -t tatar2vec-demo .
docker run -p 7860:7860 tatar2vec-demo
```
## ๐ API Access
```python
from huggingface_hub import snapshot_download
from gensim.models import FastText
# Download and load the best model
model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
# Use the model
similar_words = model.wv.most_similar('ะผำะบัำะฟ')
```
## ๐ Evaluation Metrics
Our models were evaluated on multiple dimensions:
- **Semantic Similarity**: Human-judged word pairs
- **Analogy Accuracy**: Word relationship tasks
- **OOV Handling**: Unknown word processing
- **Neighbor Coherence**: Semantic consistency
## ๐ค Contributing
We welcome contributions from the community! Areas of interest:
- Additional evaluation benchmarks
- New model architectures
- Expanded training data
- Multilingual applications
## ๐ Citation
If you use Tatar2Vec in your research, please cite:
```bibtex
@misc{tatar2vec2025,
title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
author = {Arabovs AI Lab},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
note = {Version 1.0}
}
```
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- Tatar language speakers and contributors
- Hugging Face for platform support
- Open-source community for tools and libraries
---
**Empowering Tatar Language Technology**
*Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) โข
[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) โข
[Contact Team](mailto:contact@arabovs-ai-lab.com)