---
title: Tatar2Vec Explorer
emoji: 🏆
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
app_file: app.py
---

# 🏆 Tatar2Vec Explorer

<div align="center">

**Discover the Power of Tatar Language AI**

*High-quality word embeddings for the Tatar language*

[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)

</div>

## 🌟 Overview

Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.

## 🚀 Features

### 🔍 Semantic Search
- **Word Similarity**: Find semantically similar words
- **Vector Operations**: Perform complex word analogies
- **Interactive Visualizations**: Explore results with beautiful charts and word clouds

### 🧠 Advanced Analytics
- **Model Comparison**: Compare FastText vs Word2Vec performance
- **OOV Handling**: Test out-of-vocabulary word capabilities
- **Performance Metrics**: Detailed model evaluation scores

### 🎯 Model Variants
- **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
- **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
- **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
- **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg`

## 📊 Performance Highlights

| Model | Composite Score | Semantic Similarity | OOV Handling |
|-------|----------------|-------------------|-------------|
| **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
| Meta cc.tt.300 | 0.2000 | - | - |
| **Improvement** | **3.5×** | **Significant** | **Perfect** |

## 🎮 Quick Start

### Try These Examples:

#### Word Similarity
```python
# Find words similar to "мәктәп" (school)
similar_words = model.most_similar('мәктәп', topn=10)
```

#### Word Analogies
```python
# Doctor - man + woman = ?
analogy = model.most_similar(
    positive=['табиб', 'хатын'],  # doctor, woman
    negative=['ир']               # man
)
```

#### OOV Testing (FastText Only)
```python
# Handle unknown words
vector = model['технологияләштерү']  # technology-related word
```

## 🏗️ Technical Details

### Training Corpus
- **Total Tokens**: 203.2 million
- **Vocabulary Size**: 637.7K words
- **Unique Words**: 1.8 million
- **Domains**: Wikipedia, news, books, social media

### Model Architecture
- **FastText**: Subword information support
- **Word2Vec**: Classical word embeddings
- **Optimized**: Skip-gram architecture, 100 dimensions

## 📚 Use Cases

### 🎓 Education
- Language learning applications
- Educational content analysis
- Academic research

### 💼 Business
- Content recommendation systems
- Search engine enhancement
- Customer feedback analysis

### 🔬 Research
- Linguistic studies
- Cross-lingual comparisons
- AI model development

## 🛠️ Installation

### Local Development
```bash
git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
cd tatar2vec-demo
pip install -r requirements.txt
streamlit run app.py
```

### Docker Deployment
```bash
docker build -t tatar2vec-demo .
docker run -p 7860:7860 tatar2vec-demo
```

## 🌐 API Access

```python
from huggingface_hub import snapshot_download
from gensim.models import FastText

# Download and load the best model
model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")

# Use the model
similar_words = model.wv.most_similar('мәктәп')
```

## 📊 Evaluation Metrics

Our models were evaluated on multiple dimensions:
- **Semantic Similarity**: Human-judged word pairs
- **Analogy Accuracy**: Word relationship tasks
- **OOV Handling**: Unknown word processing
- **Neighbor Coherence**: Semantic consistency

## 🤝 Contributing

We welcome contributions from the community! Areas of interest:
- Additional evaluation benchmarks
- New model architectures
- Expanded training data
- Multilingual applications

## 📜 Citation

If you use Tatar2Vec in your research, please cite:

```bibtex
@misc{tatar2vec2025,
  title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
  author = {Arabovs AI Lab},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
  note = {Version 1.0}
}
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Tatar language speakers and contributors
- Hugging Face for platform support
- Open-source community for tools and libraries

---

<div align="center">

**Empowering Tatar Language Technology**

*Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*

[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) • 
[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) • 
[Contact Team](mailto:contact@arabovs-ai-lab.com)

</div>