tatar2vec-demo / README.md
ArabovMK's picture
Update README.md
09e1de9 verified
---
title: Tatar2Vec Explorer
emoji: 🏆
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
app_file: app.py
---
# 🏆 Tatar2Vec Explorer
<div align="center">
**Discover the Power of Tatar Language AI**
*High-quality word embeddings for the Tatar language*
[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
</div>
## 🌟 Overview
Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
## 🚀 Features
### 🔍 Semantic Search
- **Word Similarity**: Find semantically similar words
- **Vector Operations**: Perform complex word analogies
- **Interactive Visualizations**: Explore results with beautiful charts and word clouds
### 🧠 Advanced Analytics
- **Model Comparison**: Compare FastText vs Word2Vec performance
- **OOV Handling**: Test out-of-vocabulary word capabilities
- **Performance Metrics**: Detailed model evaluation scores
### 🎯 Model Variants
- **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
- **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
- **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
- **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
## 📊 Performance Highlights
| Model | Composite Score | Semantic Similarity | OOV Handling |
|-------|----------------|-------------------|-------------|
| **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
| Meta cc.tt.300 | 0.2000 | - | - |
| **Improvement** | **3.5×** | **Significant** | **Perfect** |
## 🎮 Quick Start
### Try These Examples:
#### Word Similarity
```python
# Find words similar to "мәктәп" (school)
similar_words = model.most_similar('мәктәп', topn=10)
```
#### Word Analogies
```python
# Doctor - man + woman = ?
analogy = model.most_similar(
positive=['табиб', 'хатын'], # doctor, woman
negative=['ир'] # man
)
```
#### OOV Testing (FastText Only)
```python
# Handle unknown words
vector = model['технологияләштерү'] # technology-related word
```
## 🏗️ Technical Details
### Training Corpus
- **Total Tokens**: 203.2 million
- **Vocabulary Size**: 637.7K words
- **Unique Words**: 1.8 million
- **Domains**: Wikipedia, news, books, social media
### Model Architecture
- **FastText**: Subword information support
- **Word2Vec**: Classical word embeddings
- **Optimized**: Skip-gram architecture, 100 dimensions
## 📚 Use Cases
### 🎓 Education
- Language learning applications
- Educational content analysis
- Academic research
### 💼 Business
- Content recommendation systems
- Search engine enhancement
- Customer feedback analysis
### 🔬 Research
- Linguistic studies
- Cross-lingual comparisons
- AI model development
## 🛠️ Installation
### Local Development
```bash
git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
cd tatar2vec-demo
pip install -r requirements.txt
streamlit run app.py
```
### Docker Deployment
```bash
docker build -t tatar2vec-demo .
docker run -p 7860:7860 tatar2vec-demo
```
## 🌐 API Access
```python
from huggingface_hub import snapshot_download
from gensim.models import FastText
# Download and load the best model
model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
# Use the model
similar_words = model.wv.most_similar('мәктәп')
```
## 📊 Evaluation Metrics
Our models were evaluated on multiple dimensions:
- **Semantic Similarity**: Human-judged word pairs
- **Analogy Accuracy**: Word relationship tasks
- **OOV Handling**: Unknown word processing
- **Neighbor Coherence**: Semantic consistency
## 🤝 Contributing
We welcome contributions from the community! Areas of interest:
- Additional evaluation benchmarks
- New model architectures
- Expanded training data
- Multilingual applications
## 📜 Citation
If you use Tatar2Vec in your research, please cite:
```bibtex
@misc{tatar2vec2025,
title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
author = {Arabovs AI Lab},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
note = {Version 1.0}
}
```
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- Tatar language speakers and contributors
- Hugging Face for platform support
- Open-source community for tools and libraries
---
<div align="center">
**Empowering Tatar Language Technology**
*Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) •
[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) •
[Contact Team](mailto:contact@arabovs-ai-lab.com)
</div>